Prediciting Whether Job Postings are Fradulent

Executive Summary

Step 0: What’s the point?

Situation:

The rise of ghost and fraudulent job postings has become a major problem for job platforms, with recent industry reports estimating that up to 20–30% of online job ads show signs of suspicious or deceptive activity. These posts create harmful experiences for users: wasting applicants’ time, exposing them to phishing attempts, and eroding trust in the platform. They also hurt employers, who depend on credible marketplaces to attract qualified candidates. For job-posting websites like LinkedIn or Indeed, the challenge is scale: millions of posts go live every month, making manual review impossible. Predictive analytics offers a way to proactively identify high-risk postings by learning patterns that distinguish legitimate jobs from fraudulent or “ghost” listings (those that are never actually reviewed or filled). ### Data Issues:

Goal:

Models:

We will be creating a logistic regression, decision tree, SVM, random forest, KNN, and ANN model. We will also be created a stacked model that combines this individual models. This will be accomplished using a decision tree model (as the second level model put on top of these other individual models). We want to use a decision tree model as it will combine these models without aggregating. When we aggregate models, we will get a model that in an average of the models (it will not be better than the best model). With the decision tree, the goal is to take the best of each model to produce a superior model that outperforms the individual models.

Outcome:

By building and comparing models such as logistic regression, decision tree, SVM, random forest, KNN, ANN, and stacked models, we can evaluate which approach most effectively reduces false negatives, because missing a fraudulent post is far more damaging than mistakenly flagging a real one. Ultimately, the goal of this project is to design a model that improves platform safety, protects job seekers, and strengthens the integrity of online hiring ecosystems.

Step 1: Load Data

We need to load the data into an object (job) so that we can interact with it. We don’t want to set stringsAsFactors to true, since there are some string data that should not be converted to factors that we will need to deal with while cleaning the data.

# Let's store the data in the job object so we can interact with it
job <- read.csv("fake_job_postings.csv") 

Step 2: Clean Data

We need get a sense of the data, before we can start cleaning it. It is important to remove any columns that are unnecessary or would be data we should not use in our prediction models (e.g., data that would not be available to make predictions). It is also a good time to deal with NA data (if there are any). Also, we deal with the variables with too many factors (>40) do not behave well in certain models, so cosolidating them here will be important. Also, for KNN and ANN model, it is important to dummify and scale the data. So, we will use job for the logistic regression, SVM, and random forest models, use job_dummy decision tree model (the non-dummified version will not produce a model based on our data) and job_scaled for the KNN and ANN models. After cleaning the data, it is good to double check the data to ensure the desired results are achieved.

Explore Data

str(job) # Let's get a sense of columns and data types
## 'data.frame':    17880 obs. of  18 variables:
##  $ job_id             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ title              : chr  "Marketing Intern" "Customer Service - Cloud Video Production" "Commissioning Machinery Assistant (CMA)" "Account Executive - Washington DC" ...
##  $ location           : chr  "US, NY, New York" "NZ, , Auckland" "US, IA, Wever" "US, DC, Washington" ...
##  $ department         : chr  "Marketing" "Success" "" "Sales" ...
##  $ salary_range       : chr  "" "" "" "" ...
##  $ company_profile    : chr  "We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celeb"| __truncated__ "90 Seconds, the worlds Cloud Video Production Service.90 Seconds is the worlds Cloud Video Production Service e"| __truncated__ "Valor Services provides Workforce Solutions that meet the needs of companies across the Private Sector, with a "| __truncated__ "Our passion for improving quality of life through geography is at the heart of everything we do.  Esri’s geogra"| __truncated__ ...
##  $ description        : chr  "Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and curated recipe hu"| __truncated__ "Organised - Focused - Vibrant - Awesome!Do you have a passion for customer service? Slick typing skills? Maybe "| __truncated__ "Our client, located in Houston, is actively seeking an experienced Commissioning Machinery Assistant that posse"| __truncated__ "THE COMPANY: ESRI – Environmental Systems Research InstituteOur passion for improving quality of life through g"| __truncated__ ...
##  $ requirements       : chr  "Experience with content management systems a major plus (any blogging counts!)Familiar with the Food52 editoria"| __truncated__ "What we expect from you:Your key responsibility will be to communicate with the client, 90 Seconds team and fre"| __truncated__ "Implement pre-commissioning and commissioning procedures for rotary equipment.Execute all activities with subco"| __truncated__ "EDUCATION: Bachelor’s or Master’s in GIS, business administration, or a related field, or equivalent work exper"| __truncated__ ...
##  $ benefits           : chr  "" "What you will get from usThrough being part of the 90 Seconds team you will gain:experience working on projects"| __truncated__ "" "Our culture is anything but corporate—we have a collaborative, creative environment; phone directories organize"| __truncated__ ...
##  $ telecommuting      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_company_logo   : int  1 1 1 1 1 0 1 1 1 1 ...
##  $ has_questions      : int  0 0 0 0 1 0 1 1 1 0 ...
##  $ employment_type    : chr  "Other" "Full-time" "" "Full-time" ...
##  $ required_experience: chr  "Internship" "Not Applicable" "" "Mid-Senior level" ...
##  $ required_education : chr  "" "" "" "Bachelor's Degree" ...
##  $ industry           : chr  "" "Marketing and Advertising" "" "Computer Software" ...
##  $ function.          : chr  "Marketing" "Customer Service" "" "Sales" ...
##  $ fraudulent         : int  0 0 0 0 0 0 0 0 0 0 ...
summary(job)
##      job_id         title             location          department       
##  Min.   :    1   Length:17880       Length:17880       Length:17880      
##  1st Qu.: 4471   Class :character   Class :character   Class :character  
##  Median : 8940   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 8940                                                           
##  3rd Qu.:13410                                                           
##  Max.   :17880                                                           
##  salary_range       company_profile    description        requirements      
##  Length:17880       Length:17880       Length:17880       Length:17880      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    benefits         telecommuting    has_company_logo has_questions   
##  Length:17880       Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :1.0000   Median :0.0000  
##                     Mean   :0.0429   Mean   :0.7953   Mean   :0.4917  
##                     3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  employment_type    required_experience required_education   industry        
##  Length:17880       Length:17880        Length:17880       Length:17880      
##  Class :character   Class :character    Class :character   Class :character  
##  Mode  :character   Mode  :character    Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##   function.           fraudulent     
##  Length:17880       Min.   :0.00000  
##  Class :character   1st Qu.:0.00000  
##  Mode  :character   Median :0.00000  
##                     Mean   :0.04843  
##                     3rd Qu.:0.00000  
##                     Max.   :1.00000

Modify Data (Change and Delete)

We want to remove columns that are irrelevant or that we will not have access to when using this model to predict the fraudulent outcome status. We also want to be careful about how we modify data as changing data into factors or other simplifications can result in information being lost (which will hurt the robustness of our models).

job$job_id <- NULL # This is not necessary, so let's delete it

# All of these should be treated like factors instead of strings
job$location <- as.factor(job$location)
job$department <- as.factor(job$department)
job$salary_range <- as.factor(job$salary_range)
job$employment_type <- as.factor(job$employment_type)
job$required_experience <- as.factor(job$required_experience)
job$required_education <- as.factor(job$required_education)
job$industry <- as.factor(job$industry)
job$function. <- as.factor(job$function.)

Benefits Cleaning

Here we parse the words in the benefits columns and transforming them into binary variables.

# Create binary flags for top 5 signals in benefits
job$benefits_pipe <- grepl("\\|", job$benefits)       # pipe symbol
job$benefits_hash <- grepl("#", job$benefits)         # hash symbol
job$benefits_bonus <- grepl("bonus", job$benefits, ignore.case = TRUE)  # keyword: bonus
job$benefits_apply <- grepl("apply|contact", job$benefits, ignore.case = TRUE)  # keywords: apply/contact
job$benefits_benefits <- grepl("benefits", job$benefits, ignore.case = TRUE)  # keyword: benefits

# Convert to numeric 0/1 if needed
job[, c("benefits_pipe", "benefits_hash", "benefits_bonus",
        "benefits_apply", "benefits_benefits")] <- 
  lapply(job[, c("benefits_pipe", "benefits_hash", "benefits_bonus",
                 "benefits_apply", "benefits_benefits")], as.numeric)

job$benefits <- nchar(job$benefits) # Number of characters (length of benefit section) might be beneficial to our prediction model
job$benefits <- ifelse(is.na(job$benefits), mean(job$benefits, na.rm = T), job$benefits) # deal with NAs by replacing with mean value

Title Cleaning

Here we parse the words in the title columns and transforming them into binary variables.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

job <- job %>%
  mutate(
    slash_present      = if_else(str_detect(title, "/"), 1, 0),     # slash
    backslash_present  = if_else(str_detect(title, "\\\\"), 1, 0), # backslash
    amp_present        = if_else(str_detect(title, "&"), 1, 0),    # ampersand
    exclam_present     = if_else(str_detect(title, "!"), 1, 0),    # exclamation
    dash_present       = if_else(str_detect(title, "-"), 1, 0),    # dash/hyphen
    multiple_spaces    = if_else(str_detect(title, " {2,}"), 1, 0),# double spaces
    parens_present     = if_else(str_detect(title, "\\(|\\)"), 1, 0), # parentheses
    numbers_present    = if_else(str_detect(title, "[0-9]"), 1, 0) # any digits
  )


job$title <- nchar(job$title) # Number of characters (length of title section) might be beneficial to our prediction model

Requirments Cleaning

Here we parse the words in the requirements columns and transforming them into binary variables.

# Make everything lowercase once for speed
job <- job %>%
  mutate(req_clean = tolower(requirements))

# 1. Requirements missing or very short (< 10 words)
job$req_missing_or_short <- as.integer(
  is.na(job$req_clean) | str_count(job$req_clean, "\\w+") < 10
)

# 2. Heavy engineering / industrial terms
eng_terms <- c(
  "asme", "api", "ansi", "pressure vessel", "heat exchanger", 
  "pumps", "compressor", "valve", "kilovolt", "kv", 
  "scada", "plc", "p&id", "process hazard", "piping"
)

job$has_heavy_engineering_terms <- as.integer(
  str_detect(job$req_clean, str_c(eng_terms, collapse = "|"))
)

# 3. Certifications / accreditations
cert_terms <- c(
  "pmp", "pe", "certified", "license", "licence",
  "six sigma", "cfa", "osha", "hazwoper"
)

job$has_certification_terms <- as.integer(
  str_detect(job$req_clean, str_c(cert_terms, collapse = "|"))
)

# 4. Years of experience (1+, 2+, “years”, “5-10”, etc.)
job$has_years_experience <- as.integer(
  str_detect(job$req_clean, "\\d+\\+?\\s*years?")
)

# 5. Degree required
job$has_degree_required <- as.integer(
  str_detect(job$req_clean, "bachelor|degree|required degree|bs |ms |mba")
)

# 6. Software / tools
tool_terms <- c(
  "ms office", "excel", "word", "powerpoint",
  "sap", "quickbooks", "primavera", "autocad"
)

job$has_tool_software_terms <- as.integer(
  str_detect(job$req_clean, str_c(tool_terms, collapse = "|"))
)

# 7. Safety / regulatory language
safety_terms <- c(
  "osha", "compliance", "safety procedures", "audit",
  "regulations", "hazmat", "hazard"
)

job$has_safety_regulation_terms <- as.integer(
  str_detect(job$req_clean, str_c(safety_terms, collapse = "|"))
)

# 8. Heavy bullet lists / long enumerations
job$req_contains_heavy_lists <- as.integer(
  str_count(job$req_clean, "- |•|\\*|\\n") > 10
)

# 9. Title–requirement mismatch flag
# (Engineering terms inside non-engineering jobs)
job$req_title_mismatch <- with(job, as.integer(
  str_detect(req_clean, str_c(eng_terms, collapse = "|")) &
    !str_detect(tolower(title), "engineer|technician|operator|mechanic")
))

job$req_clean <- NULL # No longer need this
job$requirements <- nchar(job$requirements) # Number of characters (length of requirement section) might be beneficial to our prediction model

Description Cleaning

Here we parse the words in the description columns and transforming them into binary variables.

# Create 10 binary feature columns from job$description

job$has_urgent_language <- as.integer(grepl(
  "urgent|apply now|immediate start|asap|start immediately",
  job$description, ignore.case = TRUE))

job$has_no_experience_needed <- as.integer(grepl(
  "no experience|training provided|any background|anyone can apply",
  job$description, ignore.case = TRUE))

job$has_salary_info <- as.integer(grepl(
  "\\$|salary|per hour|per annum|k",   # detects any salary mention
  job$description, ignore.case = TRUE))

job$has_qualification_terms <- as.integer(grepl(
  "bachelor|degree|certificate|qualification|experience required",
  job$description, ignore.case = TRUE))

job$has_benefits_stated <- as.integer(grepl(
  "benefits|health insurance|401k|superannuation|paid time off|leave",
  job$description, ignore.case = TRUE))

job$has_technical_terms <- as.integer(grepl(
  "sql|python|excel|jira|crm|compliance|financial analysis",
  job$description, ignore.case = TRUE))

job$has_contact_number_or_whatsapp <- as.integer(grepl(
  "\\b\\d{3}[- ]?\\d{3}[- ]?\\d{4}\\b|whatsapp",
  job$description, ignore.case = TRUE))

job$has_company_language <- as.integer(grepl(
  "team|mission|vision|culture|values|our company",
  job$description, ignore.case = TRUE))

job$has_commission_only_language <- as.integer(grepl(
  "commission only|high earning|earn up to|unlimited income",
  job$description, ignore.case = TRUE))

job$description <- nchar(job$description) # Number of characters (length of description section) might be beneficial to our prediction model

Company Profile Cleaning

Here we parse the words in the company profile columns and transforming them into binary variables.

library(dplyr)
library(stringr)

job <- job %>%
  mutate(
    has_referral_bonus = ifelse(str_detect(tolower(company_profile), "referral bonus|bonus for referral"), 1, 0),
    has_signing_bonus  = ifelse(str_detect(tolower(company_profile), "signing bonus|bonus by"), 1, 0),
    has_perks          = ifelse(str_detect(tolower(company_profile), "perks|corporate discounts|benefits"), 1, 0),
    has_relocation     = ifelse(str_detect(tolower(company_profile), "relocation|out of town candidates|move assistance"), 1, 0)
  )

# Preview
#job %>% select(company_profile, has_referral_bonus, has_signing_bonus, has_perks, has_relocation) %>% head()
job$company_profile <- nchar(job$company_profile) # Number of characters (length of company profile section) might be beneficial to our prediction model

Salary Range Cleaning

# With 84% of the values being null, we figured it best to transform the data into binary values of whether salary is known or not. It did not make sense to try to replace NA values with the mean since we only have values for 16% of the data. 
# Count blanks
sum(job$salary_range == "", na.rm = TRUE)
## [1] 15012
job$salary_known <- ifelse(is.na(job$salary_range) | job$salary_range == "", 0, 1)
job$salary_known <- factor(job$salary_known)
table(job$salary_known)
## 
##     0     1 
## 15012  2868

Department Cleaning

# 1️⃣ Convert to character, trim, lowercase
job$department_clean <- tolower(trimws(as.character(job$department)))

# 2️⃣ Replace blanks or NAs with "NA"
job$department_clean[job$department_clean == "" | is.na(job$department_clean)] <- "NA"

# 3️⃣ Merge obvious duplicates / similar departments
merge_list <- list(
  "customer service" = c("customer service", "customer service ", "cs", "csd relay"),
  "it"               = c("it", "information technology", "it services"),
  "marketing"        = c("marketing", "performance marketing"),
  "sales"            = c("sales", "sales and marketing"),
  "administration"   = c("admin", "administrative", "administration", "administration support", "admin/clerical", "admin - clerical"),
  "accounting"       = c("accounting", "accounting/finance", "accounting and finance", "accounting & finance", "accounting/payroll"),
  "engineering"      = c("engineering", "engineering "),
  "hr"               = c("hr", "human resources"),
  "product"          = c("product", "product development", "product innovation", "product team"),
  "operations"       = c("operations", "oil & energy", "oil and gas", "maintenance"),
  "customer_facing"        = c("client services", "customer success", "customer support", "content", "creative services"),
  "business_management"    = c("business", "business development", "management", "project management"),
  "tech_development"       = c("software development", ".net", ".net development", "tech", "technical", "technical support", "design", "development"),
  "education_training"     = c("didactics", "education", "editorial"),
  "operations_logistics"   = c("warehouse", "voyageur medical transportation")
)

# Apply merges
for (new_name in names(merge_list)) {
  job$department_clean[job$department_clean %in% merge_list[[new_name]]] <- new_name
}

# 4️⃣ Group rare departments (≤10 occurrences) into "other"
dept_counts <- table(job$department_clean)
job$department_clean <- ifelse(dept_counts[job$department_clean] > 10,
                               job$department_clean,
                               "other")

# 5️⃣ Convert to factor
job$department_clean <- factor(job$department_clean)

# Check resulting counts
table(job$department_clean)
## 
##   account management           accounting       administration 
##                   13                   48                   95 
##                  all           art studio  business_management 
##                   16                   11                   82 
##             clerical           commercial             creative 
##                   27                   18                   48 
##     customer service      customer_facing           department 
##                  135                  119                   23 
##              digital   education_training           engagement 
##                   14                   58                   13 
##          engineering              finance                   hr 
##                  512                   74                   86 
## international growth                   it                legal 
##                   17                  355                   24 
##            marketing        merchandising                   NA 
##                  443                   11                11553 
##           operations operations_logistics                other 
##                  366                   28                 2252 
##            permanent              product           production 
##                   13                  185                   33 
##                   qa                  r&d               retail 
##                   18                   55                   46 
##                sales                squiz              support 
##                  594                   20                   19 
##     tech_development           technology 
##                  377                   79

Industry Clean

# Convert to character and trim whitespace
job$industry_clean <- trimws(as.character(job$industry))

# Replace blanks or NA with "NA"
job$industry_clean[job$industry_clean == "" | is.na(job$industry_clean)] <- "NA"

# Create industry groupings
industry_map <- list(
  "Technology & Software" = c(
    "Computer Software", "Information Technology and Services", "Internet",
    "Computer Games", "Computer Hardware", "Computer Networking", 
    "Computer & Network Security", "Semiconductors", "Information Services",
    "Program Development", "Nanotechnology"
  ),
  "Healthcare, Wellness & Life Sciences" = c(
    "Healthcare, Wellness & Life Sciences", "Hospital & Health Care",
    "Medical Practice", "Mental Health Care", "Health, Wellness and Fitness",
    "Pharmaceuticals", "Biotechnology", "Medical Devices", "Veterinary"
  ),
  "Finance, Banking & Insurance" = c(
    "Financial Services", "Banking", "Insurance", "Investment Management",
    "Venture Capital & Private Equity", "Capital Markets", "Investment Banking",
    "Accounting"
  ),
  "Business Administration" = c(
    "Staffing and Recruiting", "Human Resources", "Executive Office"
  ),
  "Consulting, Professional Services & Legal" = c(
    "Management Consulting", "Legal Services", "Law Practice", "Government Relations",
    "Alternative Dispute Resolution", "Individual & Family Services"
  ),
  "Consumer Goods, Retail & Fashion" = c(
    "Consumer Goods", "Consumer Services", "Retail", "Apparel & Fashion",
    "Cosmetics", "Sporting Goods", "Luxury Goods & Jewelry", "Textiles",
    "Furniture", "Consumer Electronics", "Wholesale"
  ),
  "Media, Entertainment & Creative" = c(
    "Public Relations and Communications", "Media Production", "Broadcast Media",
    "Publishing", "Music", "Entertainment", "Animation", "Graphic Design",
    "Design", "Photography", "Writing and Editing", "Motion Pictures and Film",
    "Market Research", "Online Media", "Performing Arts", "Sports",
    "Marketing and Advertising"
  ),
  "Hospitality, Travel & Leisure" = c(
    "Hospitality", "Leisure, Travel & Tourism", "Restaurants", "Gambling & Casinos",
    "Airlines/Aviation", "Events Services", "Facilities Services"
  ),
  "Education & Training" = c(
    "Education Management", "E-Learning", "Primary/Secondary Education", 
    "Higher Education", "Professional Training & Coaching", "Libraries",
    "Museums and Institutions", "Translation and Localization", "Research"
  ),
  "Manufacturing & Industrial" = c(
    "Electrical/Electronic Manufacturing", "Mechanical or Industrial Engineering",
    "Industrial Automation", "Machinery", "Chemicals", "Plastics",
    "Printing", "Packaging and Containers", "Shipbuilding", "Civil Engineering",
    "Automotive", "Business Supplies and Equipment"
  ),
  "Energy, Utilities & Environment" = c(
    "Oil & Energy", "Renewables & Environment", "Utilities", "Environmental Services",
    "Mining & Metals", "Wireless", "Telecommunications"
  ),
  "Transportation, Logistics & Supply Chain" = c(
    "Logistics and Supply Chain", "Warehousing", "Transportation/Trucking/Railroad",
    "Package/Freight Delivery", "Maritime", "Import and Export",
    "International Trade and Development", "Outsourcing/Offshoring"
  ),
  "Agriculture, Food & Natural Resources" = c(
    "Food & Beverages", "Food Production", "Farming", "Fishery", "Ranching",
    "Wine and Spirits"
  ),
  "Real Estate & Construction" = c(
    "Construction", "Real Estate", "Commercial Real Estate", "Building Materials",
    "Architecture & Planning"
  ),
  "Government, Nonprofit & Public Sector" = c(
    "Government Administration", "Nonprofit Organization Management",
    "Civic & Social Organization", "Public Policy", "Public Safety",
    "Law Enforcement", "Philanthropy", "Fund-Raising", "Religious Institutions"
  ),
  "Defense, Security & Aerospace" = c(
    "Defense & Space", "Military", "Security and Investigations", "Aviation & Aerospace"
  )
)

# Apply the mapping
for (group in names(industry_map)) {
  job$industry_clean[job$industry_clean %in% industry_map[[group]]] <- group
}

# Optional: group any remaining very rare industries (≤10 occurrences) into "other"
industry_counts <- table(job$industry_clean)
job$industry_clean <- ifelse(industry_counts[job$industry_clean] > 10,
                             job$industry_clean,
                             "other")

# Convert to factor
job$industry_clean <- factor(job$industry_clean)

# Check resulting counts
table(job$industry_clean)
## 
##     Agriculture, Food & Natural Resources 
##                                       146 
##                   Business Administration 
##                                       243 
## Consulting, Professional Services & Legal 
##                                       267 
##          Consumer Goods, Retail & Fashion 
##                                       889 
##             Defense, Security & Aerospace 
##                                        65 
##                      Education & Training 
##                                      1034 
##           Energy, Utilities & Environment 
##                                       727 
##              Finance, Banking & Insurance 
##                                      1189 
##     Government, Nonprofit & Public Sector 
##                                       199 
##      Healthcare, Wellness & Life Sciences 
##                                       814 
##             Hospitality, Travel & Leisure 
##                                       465 
##                Manufacturing & Industrial 
##                                       336 
##           Media, Entertainment & Creative 
##                                      1486 
##                                        NA 
##                                      4903 
##                Real Estate & Construction 
##                                       425 
##                     Technology & Software 
##                                      4436 
##  Transportation, Logistics & Supply Chain 
##                                       256

Function Cleaning

# Convert to character and trim whitespace
job$function_clean <- trimws(as.character(job$function.))  # assuming your column is 'function.'

# Replace blanks or NA with "NA"
job$function_clean[job$function_clean == "" | is.na(job$function_clean)] <- "NA"

# Create function groupings
function_map <- list(
  "Marketing & Advertising" = c("Marketing & Advertising", "Marketing", "Advertising", "Public Relations"),
  "Analytics & Business Development" = c("Business Development", "Data Analyst", "Business Analyst"),
  "Sales & Customer Service & IT" = c("Customer Service", "Sales", "Information Technology"),
  "Management & Leadership" = c("Management", "Strategy/Planning", "General Business", "Administrative"),
  "Engineering & Production" = c("Engineering", "Production", "Manufacturing", "Product & Project", "Product Management", "Project Management"),
  "Healthcare & Science" = c("Health Care Provider", "Science"),
  "Supply Chain & Logistics" = c("Supply Chain", "Purchasing", "Distribution"),
  "Finance & Accounting" = c("Finance", "Accounting/Auditing", "Financial Analyst"),
  "Human Resources & Training" = c("Human Resources", "Training", "Consulting"),
  "Legal & Compliance" = c("Legal", "Quality Assurance"),
  "Arts" = c("Art/Creative", "Writing/Editing", "Design")
)

# Apply the mapping
for (group in names(function_map)) {
  job$function_clean[job$function_clean %in% function_map[[group]]] <- group
}

# Optional: group any remaining very rare functions into "other"
function_counts <- table(job$function_clean)
job$function_clean <- ifelse(function_counts[job$function_clean] > 10,
                             job$function_clean,
                             "other")

# Convert to factor
job$function_clean <- factor(job$function_clean)

# Check resulting counts
table(job$function_clean)
## 
## Analytics & Business Development                             Arts 
##                              394                              604 
##                        Education         Engineering & Production 
##                              325                             1835 
##             Finance & Accounting             Healthcare & Science 
##                              417                              352 
##       Human Resources & Training               Legal & Compliance 
##                              387                              158 
##          Management & Leadership          Marketing & Advertising 
##                             1061                              996 
##                               NA                            Other 
##                             6455                              325 
##                         Research    Sales & Customer Service & IT 
##                               50                             4446 
##         Supply Chain & Logistics 
##                               75

Location Cleaning

# Grab the country 
job$loc_country <- substr(job$location, 1, 2)
job$loc_country <- as.factor(job$loc_country)
loc_count <- as.data.frame(table(job$loc_country))

sort(table(job$loc_country))
## 
##    AL    CM    CO    GH    HR    JM    KH    KZ    MA    PE    SD    SI    SV 
##     1     1     1     1     1     1     1     1     1     1     1     1     1 
##    UG    AM    BD    CL    IS    KW    LK    SK    TN    ZM    VI    NI    TT 
##     1     2     2     2     2     2     2     2     2     2     3     4     4 
##    TW    VN    CZ    LV    KE    RS    NO    AR    BH    BY    LU    PA    IQ 
##     4     4     6     6     7     7     8     9     9     9     9     9    10 
##    KR    NG    TH    CY    ID    MT    UA    AT    HU    MU    CH    CN    SA 
##    10    10    10    11    13    13    13    14    14    14    15    15    15 
##    BG    TR    MX    PT    JP    RU    MY    QA    LT    PK    FI    IT    BR 
##    17    17    18    18    20    20    21    21    23    27    29    31    36 
##    ZA    DK    RO    SE    EG    AE    ES    FR    EE    IL    PL    HK    SG 
##    40    42    46    49    52    54    66    70    72    72    76    77    80 
##    IE    BE    NL    PH    AU    IN    NZ          DE    CA    GR    GB    US 
##   114   117   127   132   214   276   333   346   383   457   940  2384 10656
# tabulate counts by country
tab <- table(job$loc_country)
# map counts back to rows
job$loc_country_val <- as.integer(tab[ as.character(job$loc_country) ])
# create new column: keep country if count > 10, else "Other"
job$loc_country_new <- ifelse(job$loc_country_val > 10, as.character(job$loc_country), "Other")
# Convert to factor first
job$loc_country_new[is.na(job$loc_country_new) | job$loc_country_new == ""] <- "NA"
job$loc_country_new <- factor(job$loc_country_new)

Check Data

# We need to remove these columns since we have cleaned them and replaced them with a cleaned version
job$loc_country_val <- NULL
job$loc_country <- NULL
job$department <- NULL
job$salary_range <- NULL
job$industry <- NULL
job$location <- NULL
job$function. <- NULL

str(job)
## 'data.frame':    17880 obs. of  52 variables:
##  $ title                         : int  16 41 39 33 19 16 21 32 10 39 ...
##  $ company_profile               : int  885 1286 879 614 1628 0 881 1025 1364 684 ...
##  $ description                   : int  905 2077 355 2600 1520 3418 433 2488 75 1219 ...
##  $ requirements                  : int  852 1433 1363 1429 757 0 764 368 359 769 ...
##  $ benefits                      : num  0 1292 0 782 21 ...
##  $ telecommuting                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_company_logo              : int  1 1 1 1 1 0 1 1 1 1 ...
##  $ has_questions                 : int  0 0 0 0 1 0 1 1 1 0 ...
##  $ employment_type               : Factor w/ 6 levels "","Contract",..: 4 3 1 3 3 1 3 1 3 5 ...
##  $ required_experience           : Factor w/ 8 levels "","Associate",..: 6 8 1 7 7 1 7 1 2 4 ...
##  $ required_education            : Factor w/ 14 levels "","Associate Degree",..: 1 1 1 3 3 1 7 1 1 6 ...
##  $ fraudulent                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_pipe                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_hash                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_bonus                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_apply                : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ benefits_benefits             : num  0 0 0 0 1 0 1 0 0 0 ...
##  $ slash_present                 : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ backslash_present             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ amp_present                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclam_present                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ dash_present                  : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ multiple_spaces               : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ parens_present                : num  0 0 1 0 0 0 1 0 0 0 ...
##  $ numbers_present               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ req_missing_or_short          : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ has_heavy_engineering_terms   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_certification_terms       : int  1 1 1 1 1 0 1 1 1 1 ...
##  $ has_years_experience          : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_degree_required           : int  1 0 1 1 1 0 1 0 0 1 ...
##  $ has_tool_software_terms       : int  1 1 0 1 0 0 0 0 0 1 ...
##  $ has_safety_regulation_terms   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ req_contains_heavy_lists      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ req_title_mismatch            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_urgent_language           : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ has_no_experience_needed      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_salary_info               : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ has_qualification_terms       : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ has_benefits_stated           : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_technical_terms           : int  0 1 0 1 1 1 0 1 0 0 ...
##  $ has_contact_number_or_whatsapp: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_company_language          : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ has_commission_only_language  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_referral_bonus            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_signing_bonus             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_perks                     : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_relocation                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ salary_known                  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ department_clean              : Factor w/ 38 levels "account management",..: 22 27 24 34 24 24 27 24 24 24 ...
##  $ industry_clean                : Factor w/ 17 levels "Agriculture, Food & Natural Resources",..: 14 13 14 16 10 14 13 14 16 8 ...
##  $ function_clean                : Factor w/ 15 levels "Analytics & Business Development",..: 10 14 11 14 6 11 9 11 11 14 ...
##  $ loc_country_new               : Factor w/ 50 levels "AE","AT","AU",..: 49 35 49 49 49 49 11 49 49 49 ...
summary(job)
##      title        company_profile   description     requirements    
##  Min.   :  3.00   Min.   :   0.0   Min.   :    3   Min.   :    0.0  
##  1st Qu.: 19.00   1st Qu.: 138.0   1st Qu.:  607   1st Qu.:  146.0  
##  Median : 25.00   Median : 570.0   Median : 1017   Median :  467.0  
##  Mean   : 28.53   Mean   : 620.9   Mean   : 1218   Mean   :  590.1  
##  3rd Qu.: 35.00   3rd Qu.: 879.0   3rd Qu.: 1586   3rd Qu.:  820.0  
##  Max.   :142.00   Max.   :6178.0   Max.   :14907   Max.   :10864.0  
##                                                                     
##     benefits      telecommuting    has_company_logo has_questions   
##  Min.   :   0.0   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :  45.0   Median :0.0000   Median :1.0000   Median :0.0000  
##  Mean   : 208.9   Mean   :0.0429   Mean   :0.7953   Mean   :0.4917  
##  3rd Qu.: 294.0   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :4429.0   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##   employment_type        required_experience                 required_education
##           : 3471                   :7050                              :8105    
##  Contract : 1524   Mid-Senior level:3809     Bachelor's Degree        :5145    
##  Full-time:11620   Entry level     :2697     High School or equivalent:2080    
##  Other    :  227   Associate       :2297     Unspecified              :1397    
##  Part-time:  797   Not Applicable  :1116     Master's Degree          : 416    
##  Temporary:  241   Director        : 389     Associate Degree         : 274    
##                    (Other)         : 522     (Other)                  : 463    
##    fraudulent      benefits_pipe      benefits_hash     benefits_bonus   
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.00000  
##  Mean   :0.04843   Mean   :0.001566   Mean   :0.05543   Mean   :0.07131  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.00000   Max.   :1.00000  
##                                                                          
##  benefits_apply    benefits_benefits slash_present     backslash_present  
##  Min.   :0.00000   Min.   :0.0000    Min.   :0.00000   Min.   :0.0000000  
##  1st Qu.:0.00000   1st Qu.:0.0000    1st Qu.:0.00000   1st Qu.:0.0000000  
##  Median :0.00000   Median :0.0000    Median :0.00000   Median :0.0000000  
##  Mean   :0.04234   Mean   :0.2012    Mean   :0.09659   Mean   :0.0001119  
##  3rd Qu.:0.00000   3rd Qu.:0.0000    3rd Qu.:0.00000   3rd Qu.:0.0000000  
##  Max.   :1.00000   Max.   :1.0000    Max.   :1.00000   Max.   :1.0000000  
##                                                                           
##   amp_present      exclam_present     dash_present   multiple_spaces   
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.00000   Median :0.000   Median :0.000000  
##  Mean   :0.03356   Mean   :0.01102   Mean   :0.169   Mean   :0.009228  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.000   Max.   :1.000000  
##                                                                        
##  parens_present    numbers_present   req_missing_or_short
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000      
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000      
##  Median :0.00000   Median :0.00000   Median :0.0000      
##  Mean   :0.08853   Mean   :0.04787   Mean   :0.1763      
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000      
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000      
##                                                          
##  has_heavy_engineering_terms has_certification_terms has_years_experience
##  Min.   :0.00000             Min.   :0.0000          Min.   :0.0000      
##  1st Qu.:0.00000             1st Qu.:1.0000          1st Qu.:0.0000      
##  Median :0.00000             Median :1.0000          Median :0.0000      
##  Mean   :0.06549             Mean   :0.7724          Mean   :0.3444      
##  3rd Qu.:0.00000             3rd Qu.:1.0000          3rd Qu.:1.0000      
##  Max.   :1.00000             Max.   :1.0000          Max.   :1.0000      
##                                                                          
##  has_degree_required has_tool_software_terms has_safety_regulation_terms
##  Min.   :0.0000      Min.   :0.0000          Min.   :0.00000            
##  1st Qu.:0.0000      1st Qu.:0.0000          1st Qu.:0.00000            
##  Median :0.0000      Median :0.0000          Median :0.00000            
##  Mean   :0.4238      Mean   :0.3497          Mean   :0.03216            
##  3rd Qu.:1.0000      3rd Qu.:1.0000          3rd Qu.:0.00000            
##  Max.   :1.0000      Max.   :1.0000          Max.   :1.00000            
##                                                                         
##  req_contains_heavy_lists req_title_mismatch has_urgent_language
##  Min.   :0.00000          Min.   :0.00000    Min.   :0.00000    
##  1st Qu.:0.00000          1st Qu.:0.00000    1st Qu.:0.00000    
##  Median :0.00000          Median :0.00000    Median :0.00000    
##  Mean   :0.02136          Mean   :0.06549    Mean   :0.07959    
##  3rd Qu.:0.00000          3rd Qu.:0.00000    3rd Qu.:0.00000    
##  Max.   :1.00000          Max.   :1.00000    Max.   :1.00000    
##                                                                 
##  has_no_experience_needed has_salary_info  has_qualification_terms
##  Min.   :0.000000         Min.   :0.0000   Min.   :0.0000         
##  1st Qu.:0.000000         1st Qu.:1.0000   1st Qu.:0.0000         
##  Median :0.000000         Median :1.0000   Median :0.0000         
##  Mean   :0.007159         Mean   :0.9705   Mean   :0.1006         
##  3rd Qu.:0.000000         3rd Qu.:1.0000   3rd Qu.:0.0000         
##  Max.   :1.000000         Max.   :1.0000   Max.   :1.0000         
##                                                                   
##  has_benefits_stated has_technical_terms has_contact_number_or_whatsapp
##  Min.   :0.00000     Min.   :0.0000      Min.   :0.00000               
##  1st Qu.:0.00000     1st Qu.:0.0000      1st Qu.:0.00000               
##  Median :0.00000     Median :0.0000      Median :0.00000               
##  Mean   :0.09077     Mean   :0.2698      Mean   :0.00179               
##  3rd Qu.:0.00000     3rd Qu.:1.0000      3rd Qu.:0.00000               
##  Max.   :1.00000     Max.   :1.0000      Max.   :1.00000               
##                                                                        
##  has_company_language has_commission_only_language has_referral_bonus
##  Min.   :0.0000       Min.   :0.00000              Min.   :0.000000  
##  1st Qu.:0.0000       1st Qu.:0.00000              1st Qu.:0.000000  
##  Median :1.0000       Median :0.00000              Median :0.000000  
##  Mean   :0.6383       Mean   :0.00453              Mean   :0.006432  
##  3rd Qu.:1.0000       3rd Qu.:0.00000              3rd Qu.:0.000000  
##  Max.   :1.0000       Max.   :1.00000              Max.   :1.000000  
##                                                                      
##  has_signing_bonus    has_perks       has_relocation     salary_known
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   0:15012     
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.000000   1: 2868     
##  Median :0.000000   Median :0.00000   Median :0.000000               
##  Mean   :0.003132   Mean   :0.05872   Mean   :0.009955               
##  3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.000000               
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.000000               
##                                                                      
##          department_clean                          industry_clean
##  NA              :11553   NA                              :4903  
##  other           : 2252   Technology & Software           :4436  
##  sales           :  594   Media, Entertainment & Creative :1486  
##  engineering     :  512   Finance, Banking & Insurance    :1189  
##  marketing       :  443   Education & Training            :1034  
##  tech_development:  377   Consumer Goods, Retail & Fashion: 889  
##  (Other)         : 2149   (Other)                         :3943  
##                        function_clean loc_country_new
##  NA                           :6455   US     :10656  
##  Sales & Customer Service & IT:4446   GB     : 2384  
##  Engineering & Production     :1835   GR     :  940  
##  Management & Leadership      :1061   CA     :  457  
##  Marketing & Advertising      : 996   DE     :  383  
##  Arts                         : 604   NA     :  346  
##  (Other)                      :2483   (Other): 2714

Dummify and Scale Data for KNN and ANN Models

# We can use job for the Logistic Regression, SVM Models, random forest, and Decision Trees

# For the Decision Tree model, we need to dummify the data
job_dummy <- as.data.frame(model.matrix(~ . -1, data = job)) 

minmax <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

# For the KNN and ANN models, we need to dummify and scale the data
job_scaled <- as.data.frame(lapply(job_dummy, minmax))

Final Data Check

# Data for Logistic Regression, SVM, and Random Forest models
str(job)
## 'data.frame':    17880 obs. of  52 variables:
##  $ title                         : int  16 41 39 33 19 16 21 32 10 39 ...
##  $ company_profile               : int  885 1286 879 614 1628 0 881 1025 1364 684 ...
##  $ description                   : int  905 2077 355 2600 1520 3418 433 2488 75 1219 ...
##  $ requirements                  : int  852 1433 1363 1429 757 0 764 368 359 769 ...
##  $ benefits                      : num  0 1292 0 782 21 ...
##  $ telecommuting                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_company_logo              : int  1 1 1 1 1 0 1 1 1 1 ...
##  $ has_questions                 : int  0 0 0 0 1 0 1 1 1 0 ...
##  $ employment_type               : Factor w/ 6 levels "","Contract",..: 4 3 1 3 3 1 3 1 3 5 ...
##  $ required_experience           : Factor w/ 8 levels "","Associate",..: 6 8 1 7 7 1 7 1 2 4 ...
##  $ required_education            : Factor w/ 14 levels "","Associate Degree",..: 1 1 1 3 3 1 7 1 1 6 ...
##  $ fraudulent                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_pipe                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_hash                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_bonus                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_apply                : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ benefits_benefits             : num  0 0 0 0 1 0 1 0 0 0 ...
##  $ slash_present                 : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ backslash_present             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ amp_present                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclam_present                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ dash_present                  : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ multiple_spaces               : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ parens_present                : num  0 0 1 0 0 0 1 0 0 0 ...
##  $ numbers_present               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ req_missing_or_short          : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ has_heavy_engineering_terms   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_certification_terms       : int  1 1 1 1 1 0 1 1 1 1 ...
##  $ has_years_experience          : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_degree_required           : int  1 0 1 1 1 0 1 0 0 1 ...
##  $ has_tool_software_terms       : int  1 1 0 1 0 0 0 0 0 1 ...
##  $ has_safety_regulation_terms   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ req_contains_heavy_lists      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ req_title_mismatch            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_urgent_language           : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ has_no_experience_needed      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_salary_info               : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ has_qualification_terms       : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ has_benefits_stated           : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_technical_terms           : int  0 1 0 1 1 1 0 1 0 0 ...
##  $ has_contact_number_or_whatsapp: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_company_language          : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ has_commission_only_language  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_referral_bonus            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_signing_bonus             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_perks                     : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_relocation                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ salary_known                  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ department_clean              : Factor w/ 38 levels "account management",..: 22 27 24 34 24 24 27 24 24 24 ...
##  $ industry_clean                : Factor w/ 17 levels "Agriculture, Food & Natural Resources",..: 14 13 14 16 10 14 13 14 16 8 ...
##  $ function_clean                : Factor w/ 15 levels "Analytics & Business Development",..: 10 14 11 14 6 11 9 11 11 14 ...
##  $ loc_country_new               : Factor w/ 50 levels "AE","AT","AU",..: 49 35 49 49 49 49 11 49 49 49 ...
summary(job)
##      title        company_profile   description     requirements    
##  Min.   :  3.00   Min.   :   0.0   Min.   :    3   Min.   :    0.0  
##  1st Qu.: 19.00   1st Qu.: 138.0   1st Qu.:  607   1st Qu.:  146.0  
##  Median : 25.00   Median : 570.0   Median : 1017   Median :  467.0  
##  Mean   : 28.53   Mean   : 620.9   Mean   : 1218   Mean   :  590.1  
##  3rd Qu.: 35.00   3rd Qu.: 879.0   3rd Qu.: 1586   3rd Qu.:  820.0  
##  Max.   :142.00   Max.   :6178.0   Max.   :14907   Max.   :10864.0  
##                                                                     
##     benefits      telecommuting    has_company_logo has_questions   
##  Min.   :   0.0   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :  45.0   Median :0.0000   Median :1.0000   Median :0.0000  
##  Mean   : 208.9   Mean   :0.0429   Mean   :0.7953   Mean   :0.4917  
##  3rd Qu.: 294.0   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :4429.0   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##   employment_type        required_experience                 required_education
##           : 3471                   :7050                              :8105    
##  Contract : 1524   Mid-Senior level:3809     Bachelor's Degree        :5145    
##  Full-time:11620   Entry level     :2697     High School or equivalent:2080    
##  Other    :  227   Associate       :2297     Unspecified              :1397    
##  Part-time:  797   Not Applicable  :1116     Master's Degree          : 416    
##  Temporary:  241   Director        : 389     Associate Degree         : 274    
##                    (Other)         : 522     (Other)                  : 463    
##    fraudulent      benefits_pipe      benefits_hash     benefits_bonus   
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.00000  
##  Mean   :0.04843   Mean   :0.001566   Mean   :0.05543   Mean   :0.07131  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.00000   Max.   :1.00000  
##                                                                          
##  benefits_apply    benefits_benefits slash_present     backslash_present  
##  Min.   :0.00000   Min.   :0.0000    Min.   :0.00000   Min.   :0.0000000  
##  1st Qu.:0.00000   1st Qu.:0.0000    1st Qu.:0.00000   1st Qu.:0.0000000  
##  Median :0.00000   Median :0.0000    Median :0.00000   Median :0.0000000  
##  Mean   :0.04234   Mean   :0.2012    Mean   :0.09659   Mean   :0.0001119  
##  3rd Qu.:0.00000   3rd Qu.:0.0000    3rd Qu.:0.00000   3rd Qu.:0.0000000  
##  Max.   :1.00000   Max.   :1.0000    Max.   :1.00000   Max.   :1.0000000  
##                                                                           
##   amp_present      exclam_present     dash_present   multiple_spaces   
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.00000   Median :0.000   Median :0.000000  
##  Mean   :0.03356   Mean   :0.01102   Mean   :0.169   Mean   :0.009228  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.000   Max.   :1.000000  
##                                                                        
##  parens_present    numbers_present   req_missing_or_short
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000      
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000      
##  Median :0.00000   Median :0.00000   Median :0.0000      
##  Mean   :0.08853   Mean   :0.04787   Mean   :0.1763      
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000      
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000      
##                                                          
##  has_heavy_engineering_terms has_certification_terms has_years_experience
##  Min.   :0.00000             Min.   :0.0000          Min.   :0.0000      
##  1st Qu.:0.00000             1st Qu.:1.0000          1st Qu.:0.0000      
##  Median :0.00000             Median :1.0000          Median :0.0000      
##  Mean   :0.06549             Mean   :0.7724          Mean   :0.3444      
##  3rd Qu.:0.00000             3rd Qu.:1.0000          3rd Qu.:1.0000      
##  Max.   :1.00000             Max.   :1.0000          Max.   :1.0000      
##                                                                          
##  has_degree_required has_tool_software_terms has_safety_regulation_terms
##  Min.   :0.0000      Min.   :0.0000          Min.   :0.00000            
##  1st Qu.:0.0000      1st Qu.:0.0000          1st Qu.:0.00000            
##  Median :0.0000      Median :0.0000          Median :0.00000            
##  Mean   :0.4238      Mean   :0.3497          Mean   :0.03216            
##  3rd Qu.:1.0000      3rd Qu.:1.0000          3rd Qu.:0.00000            
##  Max.   :1.0000      Max.   :1.0000          Max.   :1.00000            
##                                                                         
##  req_contains_heavy_lists req_title_mismatch has_urgent_language
##  Min.   :0.00000          Min.   :0.00000    Min.   :0.00000    
##  1st Qu.:0.00000          1st Qu.:0.00000    1st Qu.:0.00000    
##  Median :0.00000          Median :0.00000    Median :0.00000    
##  Mean   :0.02136          Mean   :0.06549    Mean   :0.07959    
##  3rd Qu.:0.00000          3rd Qu.:0.00000    3rd Qu.:0.00000    
##  Max.   :1.00000          Max.   :1.00000    Max.   :1.00000    
##                                                                 
##  has_no_experience_needed has_salary_info  has_qualification_terms
##  Min.   :0.000000         Min.   :0.0000   Min.   :0.0000         
##  1st Qu.:0.000000         1st Qu.:1.0000   1st Qu.:0.0000         
##  Median :0.000000         Median :1.0000   Median :0.0000         
##  Mean   :0.007159         Mean   :0.9705   Mean   :0.1006         
##  3rd Qu.:0.000000         3rd Qu.:1.0000   3rd Qu.:0.0000         
##  Max.   :1.000000         Max.   :1.0000   Max.   :1.0000         
##                                                                   
##  has_benefits_stated has_technical_terms has_contact_number_or_whatsapp
##  Min.   :0.00000     Min.   :0.0000      Min.   :0.00000               
##  1st Qu.:0.00000     1st Qu.:0.0000      1st Qu.:0.00000               
##  Median :0.00000     Median :0.0000      Median :0.00000               
##  Mean   :0.09077     Mean   :0.2698      Mean   :0.00179               
##  3rd Qu.:0.00000     3rd Qu.:1.0000      3rd Qu.:0.00000               
##  Max.   :1.00000     Max.   :1.0000      Max.   :1.00000               
##                                                                        
##  has_company_language has_commission_only_language has_referral_bonus
##  Min.   :0.0000       Min.   :0.00000              Min.   :0.000000  
##  1st Qu.:0.0000       1st Qu.:0.00000              1st Qu.:0.000000  
##  Median :1.0000       Median :0.00000              Median :0.000000  
##  Mean   :0.6383       Mean   :0.00453              Mean   :0.006432  
##  3rd Qu.:1.0000       3rd Qu.:0.00000              3rd Qu.:0.000000  
##  Max.   :1.0000       Max.   :1.00000              Max.   :1.000000  
##                                                                      
##  has_signing_bonus    has_perks       has_relocation     salary_known
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   0:15012     
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.000000   1: 2868     
##  Median :0.000000   Median :0.00000   Median :0.000000               
##  Mean   :0.003132   Mean   :0.05872   Mean   :0.009955               
##  3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.000000               
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.000000               
##                                                                      
##          department_clean                          industry_clean
##  NA              :11553   NA                              :4903  
##  other           : 2252   Technology & Software           :4436  
##  sales           :  594   Media, Entertainment & Creative :1486  
##  engineering     :  512   Finance, Banking & Insurance    :1189  
##  marketing       :  443   Education & Training            :1034  
##  tech_development:  377   Consumer Goods, Retail & Fashion: 889  
##  (Other)         : 2149   (Other)                         :3943  
##                        function_clean loc_country_new
##  NA                           :6455   US     :10656  
##  Sales & Customer Service & IT:4446   GB     : 2384  
##  Engineering & Production     :1835   GR     :  940  
##  Management & Leadership      :1061   CA     :  457  
##  Marketing & Advertising      : 996   DE     :  383  
##  Arts                         : 604   NA     :  346  
##  (Other)                      :2483   (Other): 2714
# Data for Decision Tree Model
str(job_dummy)
## 'data.frame':    17880 obs. of  187 variables:
##  $ title                                                  : num  16 41 39 33 19 16 21 32 10 39 ...
##  $ company_profile                                        : num  885 1286 879 614 1628 ...
##  $ description                                            : num  905 2077 355 2600 1520 ...
##  $ requirements                                           : num  852 1433 1363 1429 757 ...
##  $ benefits                                               : num  0 1292 0 782 21 ...
##  $ telecommuting                                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_company_logo                                       : num  1 1 1 1 1 0 1 1 1 1 ...
##  $ has_questions                                          : num  0 0 0 0 1 0 1 1 1 0 ...
##  $ employment_type                                        : num  0 0 1 0 0 1 0 1 0 0 ...
##  $ employment_typeContract                                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ employment_typeFull-time                               : num  0 1 0 1 1 0 1 0 1 0 ...
##  $ employment_typeOther                                   : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ employment_typePart-time                               : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ employment_typeTemporary                               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_experienceAssociate                           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ required_experienceDirector                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_experienceEntry level                         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ required_experienceExecutive                           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_experienceInternship                          : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ required_experienceMid-Senior level                    : num  0 0 0 1 1 0 1 0 0 0 ...
##  $ required_experienceNot Applicable                      : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ required_educationAssociate Degree                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationBachelor's Degree                    : num  0 0 0 1 1 0 0 0 0 0 ...
##  $ required_educationCertification                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationDoctorate                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationHigh School or equivalent            : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ required_educationMaster's Degree                      : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ required_educationProfessional                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationSome College Coursework Completed    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationSome High School Coursework          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationUnspecified                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationVocational                           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationVocational - Degree                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationVocational - HS Diploma              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fraudulent                                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_pipe                                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_hash                                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_bonus                                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_apply                                         : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ benefits_benefits                                      : num  0 0 0 0 1 0 1 0 0 0 ...
##  $ slash_present                                          : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ backslash_present                                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ amp_present                                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclam_present                                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ dash_present                                           : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ multiple_spaces                                        : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ parens_present                                         : num  0 0 1 0 0 0 1 0 0 0 ...
##  $ numbers_present                                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ req_missing_or_short                                   : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ has_heavy_engineering_terms                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_certification_terms                                : num  1 1 1 1 1 0 1 1 1 1 ...
##  $ has_years_experience                                   : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_degree_required                                    : num  1 0 1 1 1 0 1 0 0 1 ...
##  $ has_tool_software_terms                                : num  1 1 0 1 0 0 0 0 0 1 ...
##  $ has_safety_regulation_terms                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ req_contains_heavy_lists                               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ req_title_mismatch                                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_urgent_language                                    : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ has_no_experience_needed                               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_salary_info                                        : num  1 1 1 1 1 1 1 1 0 1 ...
##  $ has_qualification_terms                                : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ has_benefits_stated                                    : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_technical_terms                                    : num  0 1 0 1 1 1 0 1 0 0 ...
##  $ has_contact_number_or_whatsapp                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_company_language                                   : num  1 1 1 1 1 1 1 1 0 1 ...
##  $ has_commission_only_language                           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_referral_bonus                                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_signing_bonus                                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_perks                                              : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_relocation                                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ salary_known1                                          : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ department_cleanaccounting                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanadministration                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanall                                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanart studio                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanbusiness_management                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanclerical                               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleancommercial                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleancreative                               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleancustomer service                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleancustomer_facing                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleandepartment                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleandigital                                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleaneducation_training                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanengagement                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanengineering                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanfinance                                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanhr                                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleaninternational growth                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanit                                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanlegal                                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanmarketing                              : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanmerchandising                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanNA                                     : num  0 0 1 0 1 1 0 1 1 1 ...
##  $ department_cleanoperations                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanoperations_logistics                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanother                                  : num  0 1 0 0 0 0 1 0 0 0 ...
##  $ department_cleanpermanent                              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanproduct                                : num  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]
summary(job_dummy)
##      title        company_profile   description     requirements    
##  Min.   :  3.00   Min.   :   0.0   Min.   :    3   Min.   :    0.0  
##  1st Qu.: 19.00   1st Qu.: 138.0   1st Qu.:  607   1st Qu.:  146.0  
##  Median : 25.00   Median : 570.0   Median : 1017   Median :  467.0  
##  Mean   : 28.53   Mean   : 620.9   Mean   : 1218   Mean   :  590.1  
##  3rd Qu.: 35.00   3rd Qu.: 879.0   3rd Qu.: 1586   3rd Qu.:  820.0  
##  Max.   :142.00   Max.   :6178.0   Max.   :14907   Max.   :10864.0  
##     benefits      telecommuting    has_company_logo has_questions   
##  Min.   :   0.0   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :  45.0   Median :0.0000   Median :1.0000   Median :0.0000  
##  Mean   : 208.9   Mean   :0.0429   Mean   :0.7953   Mean   :0.4917  
##  3rd Qu.: 294.0   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :4429.0   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  employment_type  employment_typeContract employment_typeFull-time
##  Min.   :0.0000   Min.   :0.00000         Min.   :0.0000          
##  1st Qu.:0.0000   1st Qu.:0.00000         1st Qu.:0.0000          
##  Median :0.0000   Median :0.00000         Median :1.0000          
##  Mean   :0.1941   Mean   :0.08523         Mean   :0.6499          
##  3rd Qu.:0.0000   3rd Qu.:0.00000         3rd Qu.:1.0000          
##  Max.   :1.0000   Max.   :1.00000         Max.   :1.0000          
##  employment_typeOther employment_typePart-time employment_typeTemporary
##  Min.   :0.0000       Min.   :0.00000          Min.   :0.00000         
##  1st Qu.:0.0000       1st Qu.:0.00000          1st Qu.:0.00000         
##  Median :0.0000       Median :0.00000          Median :0.00000         
##  Mean   :0.0127       Mean   :0.04457          Mean   :0.01348         
##  3rd Qu.:0.0000       3rd Qu.:0.00000          3rd Qu.:0.00000         
##  Max.   :1.0000       Max.   :1.00000          Max.   :1.00000         
##  required_experienceAssociate required_experienceDirector
##  Min.   :0.0000               Min.   :0.00000            
##  1st Qu.:0.0000               1st Qu.:0.00000            
##  Median :0.0000               Median :0.00000            
##  Mean   :0.1285               Mean   :0.02176            
##  3rd Qu.:0.0000               3rd Qu.:0.00000            
##  Max.   :1.0000               Max.   :1.00000            
##  required_experienceEntry level required_experienceExecutive
##  Min.   :0.0000                 Min.   :0.000000            
##  1st Qu.:0.0000                 1st Qu.:0.000000            
##  Median :0.0000                 Median :0.000000            
##  Mean   :0.1508                 Mean   :0.007886            
##  3rd Qu.:0.0000                 3rd Qu.:0.000000            
##  Max.   :1.0000                 Max.   :1.000000            
##  required_experienceInternship required_experienceMid-Senior level
##  Min.   :0.00000               Min.   :0.000                      
##  1st Qu.:0.00000               1st Qu.:0.000                      
##  Median :0.00000               Median :0.000                      
##  Mean   :0.02131               Mean   :0.213                      
##  3rd Qu.:0.00000               3rd Qu.:0.000                      
##  Max.   :1.00000               Max.   :1.000                      
##  required_experienceNot Applicable required_educationAssociate Degree
##  Min.   :0.00000                   Min.   :0.00000                   
##  1st Qu.:0.00000                   1st Qu.:0.00000                   
##  Median :0.00000                   Median :0.00000                   
##  Mean   :0.06242                   Mean   :0.01532                   
##  3rd Qu.:0.00000                   3rd Qu.:0.00000                   
##  Max.   :1.00000                   Max.   :1.00000                   
##  required_educationBachelor's Degree required_educationCertification
##  Min.   :0.0000                      Min.   :0.000000               
##  1st Qu.:0.0000                      1st Qu.:0.000000               
##  Median :0.0000                      Median :0.000000               
##  Mean   :0.2878                      Mean   :0.009508               
##  3rd Qu.:1.0000                      3rd Qu.:0.000000               
##  Max.   :1.0000                      Max.   :1.000000               
##  required_educationDoctorate required_educationHigh School or equivalent
##  Min.   :0.000000            Min.   :0.0000                             
##  1st Qu.:0.000000            1st Qu.:0.0000                             
##  Median :0.000000            Median :0.0000                             
##  Mean   :0.001454            Mean   :0.1163                             
##  3rd Qu.:0.000000            3rd Qu.:0.0000                             
##  Max.   :1.000000            Max.   :1.0000                             
##  required_educationMaster's Degree required_educationProfessional
##  Min.   :0.00000                   Min.   :0.000000              
##  1st Qu.:0.00000                   1st Qu.:0.000000              
##  Median :0.00000                   Median :0.000000              
##  Mean   :0.02327                   Mean   :0.004139              
##  3rd Qu.:0.00000                   3rd Qu.:0.000000              
##  Max.   :1.00000                   Max.   :1.000000              
##  required_educationSome College Coursework Completed
##  Min.   :0.000000                                   
##  1st Qu.:0.000000                                   
##  Median :0.000000                                   
##  Mean   :0.005705                                   
##  3rd Qu.:0.000000                                   
##  Max.   :1.000000                                   
##  required_educationSome High School Coursework required_educationUnspecified
##  Min.   :0.00000                               Min.   :0.00000              
##  1st Qu.:0.00000                               1st Qu.:0.00000              
##  Median :0.00000                               Median :0.00000              
##  Mean   :0.00151                               Mean   :0.07813              
##  3rd Qu.:0.00000                               3rd Qu.:0.00000              
##  Max.   :1.00000                               Max.   :1.00000              
##  required_educationVocational required_educationVocational - Degree
##  Min.   :0.00000              Min.   :0.0000000                    
##  1st Qu.:0.00000              1st Qu.:0.0000000                    
##  Median :0.00000              Median :0.0000000                    
##  Mean   :0.00274              Mean   :0.0003356                    
##  3rd Qu.:0.00000              3rd Qu.:0.0000000                    
##  Max.   :1.00000              Max.   :1.0000000                    
##  required_educationVocational - HS Diploma   fraudulent      benefits_pipe     
##  Min.   :0.0000000                         Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.0000000                         1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.0000000                         Median :0.00000   Median :0.000000  
##  Mean   :0.0005034                         Mean   :0.04843   Mean   :0.001566  
##  3rd Qu.:0.0000000                         3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.0000000                         Max.   :1.00000   Max.   :1.000000  
##  benefits_hash     benefits_bonus    benefits_apply    benefits_benefits
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.0000   
##  Mean   :0.05543   Mean   :0.07131   Mean   :0.04234   Mean   :0.2012   
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000   
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   
##  slash_present     backslash_present    amp_present      exclam_present   
##  Min.   :0.00000   Min.   :0.0000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.0000000   Median :0.00000   Median :0.00000  
##  Mean   :0.09659   Mean   :0.0001119   Mean   :0.03356   Mean   :0.01102  
##  3rd Qu.:0.00000   3rd Qu.:0.0000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.0000000   Max.   :1.00000   Max.   :1.00000  
##   dash_present   multiple_spaces    parens_present    numbers_present  
##  Min.   :0.000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.000   Median :0.000000   Median :0.00000   Median :0.00000  
##  Mean   :0.169   Mean   :0.009228   Mean   :0.08853   Mean   :0.04787  
##  3rd Qu.:0.000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.000   Max.   :1.000000   Max.   :1.00000   Max.   :1.00000  
##  req_missing_or_short has_heavy_engineering_terms has_certification_terms
##  Min.   :0.0000       Min.   :0.00000             Min.   :0.0000         
##  1st Qu.:0.0000       1st Qu.:0.00000             1st Qu.:1.0000         
##  Median :0.0000       Median :0.00000             Median :1.0000         
##  Mean   :0.1763       Mean   :0.06549             Mean   :0.7724         
##  3rd Qu.:0.0000       3rd Qu.:0.00000             3rd Qu.:1.0000         
##  Max.   :1.0000       Max.   :1.00000             Max.   :1.0000         
##  has_years_experience has_degree_required has_tool_software_terms
##  Min.   :0.0000       Min.   :0.0000      Min.   :0.0000         
##  1st Qu.:0.0000       1st Qu.:0.0000      1st Qu.:0.0000         
##  Median :0.0000       Median :0.0000      Median :0.0000         
##  Mean   :0.3444       Mean   :0.4238      Mean   :0.3497         
##  3rd Qu.:1.0000       3rd Qu.:1.0000      3rd Qu.:1.0000         
##  Max.   :1.0000       Max.   :1.0000      Max.   :1.0000         
##  has_safety_regulation_terms req_contains_heavy_lists req_title_mismatch
##  Min.   :0.00000             Min.   :0.00000          Min.   :0.00000   
##  1st Qu.:0.00000             1st Qu.:0.00000          1st Qu.:0.00000   
##  Median :0.00000             Median :0.00000          Median :0.00000   
##  Mean   :0.03216             Mean   :0.02136          Mean   :0.06549   
##  3rd Qu.:0.00000             3rd Qu.:0.00000          3rd Qu.:0.00000   
##  Max.   :1.00000             Max.   :1.00000          Max.   :1.00000   
##  has_urgent_language has_no_experience_needed has_salary_info 
##  Min.   :0.00000     Min.   :0.000000         Min.   :0.0000  
##  1st Qu.:0.00000     1st Qu.:0.000000         1st Qu.:1.0000  
##  Median :0.00000     Median :0.000000         Median :1.0000  
##  Mean   :0.07959     Mean   :0.007159         Mean   :0.9705  
##  3rd Qu.:0.00000     3rd Qu.:0.000000         3rd Qu.:1.0000  
##  Max.   :1.00000     Max.   :1.000000         Max.   :1.0000  
##  has_qualification_terms has_benefits_stated has_technical_terms
##  Min.   :0.0000          Min.   :0.00000     Min.   :0.0000     
##  1st Qu.:0.0000          1st Qu.:0.00000     1st Qu.:0.0000     
##  Median :0.0000          Median :0.00000     Median :0.0000     
##  Mean   :0.1006          Mean   :0.09077     Mean   :0.2698     
##  3rd Qu.:0.0000          3rd Qu.:0.00000     3rd Qu.:1.0000     
##  Max.   :1.0000          Max.   :1.00000     Max.   :1.0000     
##  has_contact_number_or_whatsapp has_company_language
##  Min.   :0.00000                Min.   :0.0000      
##  1st Qu.:0.00000                1st Qu.:0.0000      
##  Median :0.00000                Median :1.0000      
##  Mean   :0.00179                Mean   :0.6383      
##  3rd Qu.:0.00000                3rd Qu.:1.0000      
##  Max.   :1.00000                Max.   :1.0000      
##  has_commission_only_language has_referral_bonus has_signing_bonus 
##  Min.   :0.00000              Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.00000              1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.00000              Median :0.000000   Median :0.000000  
##  Mean   :0.00453              Mean   :0.006432   Mean   :0.003132  
##  3rd Qu.:0.00000              3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.00000              Max.   :1.000000   Max.   :1.000000  
##    has_perks       has_relocation     salary_known1   
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.000000   Median :0.0000  
##  Mean   :0.05872   Mean   :0.009955   Mean   :0.1604  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.0000  
##  department_cleanaccounting department_cleanadministration department_cleanall
##  Min.   :0.000000           Min.   :0.000000               Min.   :0.0000000  
##  1st Qu.:0.000000           1st Qu.:0.000000               1st Qu.:0.0000000  
##  Median :0.000000           Median :0.000000               Median :0.0000000  
##  Mean   :0.002685           Mean   :0.005313               Mean   :0.0008948  
##  3rd Qu.:0.000000           3rd Qu.:0.000000               3rd Qu.:0.0000000  
##  Max.   :1.000000           Max.   :1.000000               Max.   :1.0000000  
##  department_cleanart studio department_cleanbusiness_management
##  Min.   :0.0000000          Min.   :0.000000                   
##  1st Qu.:0.0000000          1st Qu.:0.000000                   
##  Median :0.0000000          Median :0.000000                   
##  Mean   :0.0006152          Mean   :0.004586                   
##  3rd Qu.:0.0000000          3rd Qu.:0.000000                   
##  Max.   :1.0000000          Max.   :1.000000                   
##  department_cleanclerical department_cleancommercial department_cleancreative
##  Min.   :0.00000          Min.   :0.000000           Min.   :0.000000        
##  1st Qu.:0.00000          1st Qu.:0.000000           1st Qu.:0.000000        
##  Median :0.00000          Median :0.000000           Median :0.000000        
##  Mean   :0.00151          Mean   :0.001007           Mean   :0.002685        
##  3rd Qu.:0.00000          3rd Qu.:0.000000           3rd Qu.:0.000000        
##  Max.   :1.00000          Max.   :1.000000           Max.   :1.000000        
##  department_cleancustomer service department_cleancustomer_facing
##  Min.   :0.00000                  Min.   :0.000000               
##  1st Qu.:0.00000                  1st Qu.:0.000000               
##  Median :0.00000                  Median :0.000000               
##  Mean   :0.00755                  Mean   :0.006655               
##  3rd Qu.:0.00000                  3rd Qu.:0.000000               
##  Max.   :1.00000                  Max.   :1.000000               
##  department_cleandepartment department_cleandigital
##  Min.   :0.000000           Min.   :0.000000       
##  1st Qu.:0.000000           1st Qu.:0.000000       
##  Median :0.000000           Median :0.000000       
##  Mean   :0.001286           Mean   :0.000783       
##  3rd Qu.:0.000000           3rd Qu.:0.000000       
##  Max.   :1.000000           Max.   :1.000000       
##  department_cleaneducation_training department_cleanengagement
##  Min.   :0.000000                   Min.   :0.0000000         
##  1st Qu.:0.000000                   1st Qu.:0.0000000         
##  Median :0.000000                   Median :0.0000000         
##  Mean   :0.003244                   Mean   :0.0007271         
##  3rd Qu.:0.000000                   3rd Qu.:0.0000000         
##  Max.   :1.000000                   Max.   :1.0000000         
##  department_cleanengineering department_cleanfinance department_cleanhr
##  Min.   :0.00000             Min.   :0.000000        Min.   :0.00000   
##  1st Qu.:0.00000             1st Qu.:0.000000        1st Qu.:0.00000   
##  Median :0.00000             Median :0.000000        Median :0.00000   
##  Mean   :0.02864             Mean   :0.004139        Mean   :0.00481   
##  3rd Qu.:0.00000             3rd Qu.:0.000000        3rd Qu.:0.00000   
##  Max.   :1.00000             Max.   :1.000000        Max.   :1.00000   
##  department_cleaninternational growth department_cleanit department_cleanlegal
##  Min.   :0.0000000                    Min.   :0.00000    Min.   :0.000000     
##  1st Qu.:0.0000000                    1st Qu.:0.00000    1st Qu.:0.000000     
##  Median :0.0000000                    Median :0.00000    Median :0.000000     
##  Mean   :0.0009508                    Mean   :0.01985    Mean   :0.001342     
##  3rd Qu.:0.0000000                    3rd Qu.:0.00000    3rd Qu.:0.000000     
##  Max.   :1.0000000                    Max.   :1.00000    Max.   :1.000000     
##  department_cleanmarketing department_cleanmerchandising department_cleanNA
##  Min.   :0.00000           Min.   :0.0000000             Min.   :0.0000    
##  1st Qu.:0.00000           1st Qu.:0.0000000             1st Qu.:0.0000    
##  Median :0.00000           Median :0.0000000             Median :1.0000    
##  Mean   :0.02478           Mean   :0.0006152             Mean   :0.6461    
##  3rd Qu.:0.00000           3rd Qu.:0.0000000             3rd Qu.:1.0000    
##  Max.   :1.00000           Max.   :1.0000000             Max.   :1.0000    
##  department_cleanoperations department_cleanoperations_logistics
##  Min.   :0.00000            Min.   :0.000000                    
##  1st Qu.:0.00000            1st Qu.:0.000000                    
##  Median :0.00000            Median :0.000000                    
##  Mean   :0.02047            Mean   :0.001566                    
##  3rd Qu.:0.00000            3rd Qu.:0.000000                    
##  Max.   :1.00000            Max.   :1.000000                    
##  department_cleanother department_cleanpermanent department_cleanproduct
##  Min.   :0.000         Min.   :0.0000000         Min.   :0.00000        
##  1st Qu.:0.000         1st Qu.:0.0000000         1st Qu.:0.00000        
##  Median :0.000         Median :0.0000000         Median :0.00000        
##  Mean   :0.126         Mean   :0.0007271         Mean   :0.01035        
##  3rd Qu.:0.000         3rd Qu.:0.0000000         3rd Qu.:0.00000        
##  Max.   :1.000         Max.   :1.0000000         Max.   :1.00000        
##  department_cleanproduction department_cleanqa department_cleanr&d
##  Min.   :0.000000           Min.   :0.000000   Min.   :0.000000   
##  1st Qu.:0.000000           1st Qu.:0.000000   1st Qu.:0.000000   
##  Median :0.000000           Median :0.000000   Median :0.000000   
##  Mean   :0.001846           Mean   :0.001007   Mean   :0.003076   
##  3rd Qu.:0.000000           3rd Qu.:0.000000   3rd Qu.:0.000000   
##  Max.   :1.000000           Max.   :1.000000   Max.   :1.000000   
##  department_cleanretail department_cleansales department_cleansquiz
##  Min.   :0.000000       Min.   :0.00000       Min.   :0.000000     
##  1st Qu.:0.000000       1st Qu.:0.00000       1st Qu.:0.000000     
##  Median :0.000000       Median :0.00000       Median :0.000000     
##  Mean   :0.002573       Mean   :0.03322       Mean   :0.001119     
##  3rd Qu.:0.000000       3rd Qu.:0.00000       3rd Qu.:0.000000     
##  Max.   :1.000000       Max.   :1.00000       Max.   :1.000000     
##  department_cleansupport department_cleantech_development
##  Min.   :0.000000        Min.   :0.00000                 
##  1st Qu.:0.000000        1st Qu.:0.00000                 
##  Median :0.000000        Median :0.00000                 
##  Mean   :0.001063        Mean   :0.02109                 
##  3rd Qu.:0.000000        3rd Qu.:0.00000                 
##  Max.   :1.000000        Max.   :1.00000                 
##  department_cleantechnology industry_cleanBusiness Administration
##  Min.   :0.000000           Min.   :0.00000                      
##  1st Qu.:0.000000           1st Qu.:0.00000                      
##  Median :0.000000           Median :0.00000                      
##  Mean   :0.004418           Mean   :0.01359                      
##  3rd Qu.:0.000000           3rd Qu.:0.00000                      
##  Max.   :1.000000           Max.   :1.00000                      
##  industry_cleanConsulting, Professional Services & Legal
##  Min.   :0.00000                                        
##  1st Qu.:0.00000                                        
##  Median :0.00000                                        
##  Mean   :0.01493                                        
##  3rd Qu.:0.00000                                        
##  Max.   :1.00000                                        
##  industry_cleanConsumer Goods, Retail & Fashion
##  Min.   :0.00000                               
##  1st Qu.:0.00000                               
##  Median :0.00000                               
##  Mean   :0.04972                               
##  3rd Qu.:0.00000                               
##  Max.   :1.00000                               
##  industry_cleanDefense, Security & Aerospace industry_cleanEducation & Training
##  Min.   :0.000000                            Min.   :0.00000                   
##  1st Qu.:0.000000                            1st Qu.:0.00000                   
##  Median :0.000000                            Median :0.00000                   
##  Mean   :0.003635                            Mean   :0.05783                   
##  3rd Qu.:0.000000                            3rd Qu.:0.00000                   
##  Max.   :1.000000                            Max.   :1.00000                   
##  industry_cleanEnergy, Utilities & Environment
##  Min.   :0.00000                              
##  1st Qu.:0.00000                              
##  Median :0.00000                              
##  Mean   :0.04066                              
##  3rd Qu.:0.00000                              
##  Max.   :1.00000                              
##  industry_cleanFinance, Banking & Insurance
##  Min.   :0.0000                            
##  1st Qu.:0.0000                            
##  Median :0.0000                            
##  Mean   :0.0665                            
##  3rd Qu.:0.0000                            
##  Max.   :1.0000                            
##  industry_cleanGovernment, Nonprofit & Public Sector
##  Min.   :0.00000                                    
##  1st Qu.:0.00000                                    
##  Median :0.00000                                    
##  Mean   :0.01113                                    
##  3rd Qu.:0.00000                                    
##  Max.   :1.00000                                    
##  industry_cleanHealthcare, Wellness & Life Sciences
##  Min.   :0.00000                                   
##  1st Qu.:0.00000                                   
##  Median :0.00000                                   
##  Mean   :0.04553                                   
##  3rd Qu.:0.00000                                   
##  Max.   :1.00000                                   
##  industry_cleanHospitality, Travel & Leisure
##  Min.   :0.00000                            
##  1st Qu.:0.00000                            
##  Median :0.00000                            
##  Mean   :0.02601                            
##  3rd Qu.:0.00000                            
##  Max.   :1.00000                            
##  industry_cleanManufacturing & Industrial
##  Min.   :0.00000                         
##  1st Qu.:0.00000                         
##  Median :0.00000                         
##  Mean   :0.01879                         
##  3rd Qu.:0.00000                         
##  Max.   :1.00000                         
##  industry_cleanMedia, Entertainment & Creative industry_cleanNA
##  Min.   :0.00000                               Min.   :0.0000  
##  1st Qu.:0.00000                               1st Qu.:0.0000  
##  Median :0.00000                               Median :0.0000  
##  Mean   :0.08311                               Mean   :0.2742  
##  3rd Qu.:0.00000                               3rd Qu.:1.0000  
##  Max.   :1.00000                               Max.   :1.0000  
##  industry_cleanReal Estate & Construction industry_cleanTechnology & Software
##  Min.   :0.00000                          Min.   :0.0000                     
##  1st Qu.:0.00000                          1st Qu.:0.0000                     
##  Median :0.00000                          Median :0.0000                     
##  Mean   :0.02377                          Mean   :0.2481                     
##  3rd Qu.:0.00000                          3rd Qu.:0.0000                     
##  Max.   :1.00000                          Max.   :1.0000                     
##  industry_cleanTransportation, Logistics & Supply Chain function_cleanArts
##  Min.   :0.00000                                        Min.   :0.00000   
##  1st Qu.:0.00000                                        1st Qu.:0.00000   
##  Median :0.00000                                        Median :0.00000   
##  Mean   :0.01432                                        Mean   :0.03378   
##  3rd Qu.:0.00000                                        3rd Qu.:0.00000   
##  Max.   :1.00000                                        Max.   :1.00000   
##  function_cleanEducation function_cleanEngineering & Production
##  Min.   :0.00000         Min.   :0.0000                        
##  1st Qu.:0.00000         1st Qu.:0.0000                        
##  Median :0.00000         Median :0.0000                        
##  Mean   :0.01818         Mean   :0.1026                        
##  3rd Qu.:0.00000         3rd Qu.:0.0000                        
##  Max.   :1.00000         Max.   :1.0000                        
##  function_cleanFinance & Accounting function_cleanHealthcare & Science
##  Min.   :0.00000                    Min.   :0.00000                   
##  1st Qu.:0.00000                    1st Qu.:0.00000                   
##  Median :0.00000                    Median :0.00000                   
##  Mean   :0.02332                    Mean   :0.01969                   
##  3rd Qu.:0.00000                    3rd Qu.:0.00000                   
##  Max.   :1.00000                    Max.   :1.00000                   
##  function_cleanHuman Resources & Training function_cleanLegal & Compliance
##  Min.   :0.00000                          Min.   :0.000000                
##  1st Qu.:0.00000                          1st Qu.:0.000000                
##  Median :0.00000                          Median :0.000000                
##  Mean   :0.02164                          Mean   :0.008837                
##  3rd Qu.:0.00000                          3rd Qu.:0.000000                
##  Max.   :1.00000                          Max.   :1.000000                
##  function_cleanManagement & Leadership function_cleanMarketing & Advertising
##  Min.   :0.00000                       Min.   :0.0000                       
##  1st Qu.:0.00000                       1st Qu.:0.0000                       
##  Median :0.00000                       Median :0.0000                       
##  Mean   :0.05934                       Mean   :0.0557                       
##  3rd Qu.:0.00000                       3rd Qu.:0.0000                       
##  Max.   :1.00000                       Max.   :1.0000                       
##  function_cleanNA function_cleanOther function_cleanResearch
##  Min.   :0.000    Min.   :0.00000     Min.   :0.000000      
##  1st Qu.:0.000    1st Qu.:0.00000     1st Qu.:0.000000      
##  Median :0.000    Median :0.00000     Median :0.000000      
##  Mean   :0.361    Mean   :0.01818     Mean   :0.002796      
##  3rd Qu.:1.000    3rd Qu.:0.00000     3rd Qu.:0.000000      
##  Max.   :1.000    Max.   :1.00000     Max.   :1.000000      
##  function_cleanSales & Customer Service & IT
##  Min.   :0.0000                             
##  1st Qu.:0.0000                             
##  Median :0.0000                             
##  Mean   :0.2487                             
##  3rd Qu.:0.0000                             
##  Max.   :1.0000                             
##  function_cleanSupply Chain & Logistics loc_country_newAT  loc_country_newAU
##  Min.   :0.000000                       Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.000000                       1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.000000                       Median :0.000000   Median :0.00000  
##  Mean   :0.004195                       Mean   :0.000783   Mean   :0.01197  
##  3rd Qu.:0.000000                       3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.000000                       Max.   :1.000000   Max.   :1.00000  
##  loc_country_newBE  loc_country_newBG   loc_country_newBR  loc_country_newCA
##  Min.   :0.000000   Min.   :0.0000000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.0000000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.000000   Median :0.0000000   Median :0.000000   Median :0.00000  
##  Mean   :0.006544   Mean   :0.0009508   Mean   :0.002013   Mean   :0.02556  
##  3rd Qu.:0.000000   3rd Qu.:0.0000000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.0000000   Max.   :1.000000   Max.   :1.00000  
##  loc_country_newCH   loc_country_newCN   loc_country_newCY   loc_country_newDE
##  Min.   :0.0000000   Min.   :0.0000000   Min.   :0.0000000   Min.   :0.00000  
##  1st Qu.:0.0000000   1st Qu.:0.0000000   1st Qu.:0.0000000   1st Qu.:0.00000  
##  Median :0.0000000   Median :0.0000000   Median :0.0000000   Median :0.00000  
##  Mean   :0.0008389   Mean   :0.0008389   Mean   :0.0006152   Mean   :0.02142  
##  3rd Qu.:0.0000000   3rd Qu.:0.0000000   3rd Qu.:0.0000000   3rd Qu.:0.00000  
##  Max.   :1.0000000   Max.   :1.0000000   Max.   :1.0000000   Max.   :1.00000  
##  loc_country_newDK  loc_country_newEE  loc_country_newEG  loc_country_newES 
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.000000   Median :0.000000  
##  Mean   :0.002349   Mean   :0.004027   Mean   :0.002908   Mean   :0.003691  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.000000   Max.   :1.000000  
##  loc_country_newFI  loc_country_newFR  loc_country_newGB loc_country_newGR
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.0000    Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.0000    1st Qu.:0.00000  
##  Median :0.000000   Median :0.000000   Median :0.0000    Median :0.00000  
##  Mean   :0.001622   Mean   :0.003915   Mean   :0.1333    Mean   :0.05257  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.0000    3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.0000    Max.   :1.00000  
##  loc_country_newHK  loc_country_newHU  loc_country_newID   loc_country_newIE 
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.0000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.0000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.0000000   Median :0.000000  
##  Mean   :0.004306   Mean   :0.000783   Mean   :0.0007271   Mean   :0.006376  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.0000000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.0000000   Max.   :1.000000  
##  loc_country_newIL  loc_country_newIN loc_country_newIT  loc_country_newJP 
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.00000   Median :0.000000   Median :0.000000  
##  Mean   :0.004027   Mean   :0.01544   Mean   :0.001734   Mean   :0.001119  
##  3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.000000  
##  loc_country_newLT  loc_country_newMT   loc_country_newMU  loc_country_newMX 
##  Min.   :0.000000   Min.   :0.0000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.0000000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.0000000   Median :0.000000   Median :0.000000  
##  Mean   :0.001286   Mean   :0.0007271   Mean   :0.000783   Mean   :0.001007  
##  3rd Qu.:0.000000   3rd Qu.:0.0000000   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.0000000   Max.   :1.000000   Max.   :1.000000  
##  loc_country_newMY  loc_country_newNA loc_country_newNL  loc_country_newNZ
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.000000   Median :0.00000   Median :0.000000   Median :0.00000  
##  Mean   :0.001174   Mean   :0.01935   Mean   :0.007103   Mean   :0.01862  
##  3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000  
##  loc_country_newOther loc_country_newPH  loc_country_newPK loc_country_newPL 
##  Min.   :0.000000     Min.   :0.000000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.000000     1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.000000     Median :0.000000   Median :0.00000   Median :0.000000  
##  Mean   :0.009508     Mean   :0.007383   Mean   :0.00151   Mean   :0.004251  
##  3rd Qu.:0.000000     3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.000000     Max.   :1.000000   Max.   :1.00000   Max.   :1.000000  
##  loc_country_newPT  loc_country_newQA  loc_country_newRO  loc_country_newRU 
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.000000   Median :0.000000  
##  Mean   :0.001007   Mean   :0.001174   Mean   :0.002573   Mean   :0.001119  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.000000   Max.   :1.000000  
##  loc_country_newSA   loc_country_newSE loc_country_newSG  loc_country_newTR  
##  Min.   :0.0000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.0000000  
##  1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000000  
##  Median :0.0000000   Median :0.00000   Median :0.000000   Median :0.0000000  
##  Mean   :0.0008389   Mean   :0.00274   Mean   :0.004474   Mean   :0.0009508  
##  3rd Qu.:0.0000000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.0000000  
##  Max.   :1.0000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.0000000  
##  loc_country_newUA   loc_country_newUS loc_country_newZA 
##  Min.   :0.0000000   Min.   :0.000     Min.   :0.000000  
##  1st Qu.:0.0000000   1st Qu.:0.000     1st Qu.:0.000000  
##  Median :0.0000000   Median :1.000     Median :0.000000  
##  Mean   :0.0007271   Mean   :0.596     Mean   :0.002237  
##  3rd Qu.:0.0000000   3rd Qu.:1.000     3rd Qu.:0.000000  
##  Max.   :1.0000000   Max.   :1.000     Max.   :1.000000
# Data for KNN and ANN models
str(job_scaled)
## 'data.frame':    17880 obs. of  187 variables:
##  $ title                                                  : num  0.0935 0.2734 0.259 0.2158 0.1151 ...
##  $ company_profile                                        : num  0.1433 0.2082 0.1423 0.0994 0.2635 ...
##  $ description                                            : num  0.0605 0.1392 0.0236 0.1742 0.1018 ...
##  $ requirements                                           : num  0.0784 0.1319 0.1255 0.1315 0.0697 ...
##  $ benefits                                               : num  0 0.29171 0 0.17656 0.00474 ...
##  $ telecommuting                                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_company_logo                                       : num  1 1 1 1 1 0 1 1 1 1 ...
##  $ has_questions                                          : num  0 0 0 0 1 0 1 1 1 0 ...
##  $ employment_type                                        : num  0 0 1 0 0 1 0 1 0 0 ...
##  $ employment_typeContract                                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ employment_typeFull.time                               : num  0 1 0 1 1 0 1 0 1 0 ...
##  $ employment_typeOther                                   : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ employment_typePart.time                               : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ employment_typeTemporary                               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_experienceAssociate                           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ required_experienceDirector                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_experienceEntry.level                         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ required_experienceExecutive                           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_experienceInternship                          : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ required_experienceMid.Senior.level                    : num  0 0 0 1 1 0 1 0 0 0 ...
##  $ required_experienceNot.Applicable                      : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ required_educationAssociate.Degree                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationBachelor.s.Degree                    : num  0 0 0 1 1 0 0 0 0 0 ...
##  $ required_educationCertification                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationDoctorate                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationHigh.School.or.equivalent            : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ required_educationMaster.s.Degree                      : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ required_educationProfessional                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationSome.College.Coursework.Completed    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationSome.High.School.Coursework          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationUnspecified                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationVocational                           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationVocational...Degree                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ required_educationVocational...HS.Diploma              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fraudulent                                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_pipe                                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_hash                                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_bonus                                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ benefits_apply                                         : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ benefits_benefits                                      : num  0 0 0 0 1 0 1 0 0 0 ...
##  $ slash_present                                          : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ backslash_present                                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ amp_present                                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclam_present                                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ dash_present                                           : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ multiple_spaces                                        : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ parens_present                                         : num  0 0 1 0 0 0 1 0 0 0 ...
##  $ numbers_present                                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ req_missing_or_short                                   : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ has_heavy_engineering_terms                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_certification_terms                                : num  1 1 1 1 1 0 1 1 1 1 ...
##  $ has_years_experience                                   : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_degree_required                                    : num  1 0 1 1 1 0 1 0 0 1 ...
##  $ has_tool_software_terms                                : num  1 1 0 1 0 0 0 0 0 1 ...
##  $ has_safety_regulation_terms                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ req_contains_heavy_lists                               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ req_title_mismatch                                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_urgent_language                                    : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ has_no_experience_needed                               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_salary_info                                        : num  1 1 1 1 1 1 1 1 0 1 ...
##  $ has_qualification_terms                                : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ has_benefits_stated                                    : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_technical_terms                                    : num  0 1 0 1 1 1 0 1 0 0 ...
##  $ has_contact_number_or_whatsapp                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_company_language                                   : num  1 1 1 1 1 1 1 1 0 1 ...
##  $ has_commission_only_language                           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_referral_bonus                                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_signing_bonus                                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_perks                                              : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_relocation                                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ salary_known1                                          : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ department_cleanaccounting                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanadministration                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanall                                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanart.studio                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanbusiness_management                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanclerical                               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleancommercial                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleancreative                               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleancustomer.service                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleancustomer_facing                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleandepartment                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleandigital                                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleaneducation_training                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanengagement                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanengineering                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanfinance                                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanhr                                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleaninternational.growth                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanit                                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanlegal                                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanmarketing                              : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanmerchandising                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanNA                                     : num  0 0 1 0 1 1 0 1 1 1 ...
##  $ department_cleanoperations                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanoperations_logistics                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanother                                  : num  0 1 0 0 0 0 1 0 0 0 ...
##  $ department_cleanpermanent                              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ department_cleanproduct                                : num  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]
summary(job_scaled)
##      title        company_profile    description       requirements    
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.1151   1st Qu.:0.02234   1st Qu.:0.04053   1st Qu.:0.01344  
##  Median :0.1583   Median :0.09226   Median :0.06804   Median :0.04299  
##  Mean   :0.1837   Mean   :0.10050   Mean   :0.08152   Mean   :0.05432  
##  3rd Qu.:0.2302   3rd Qu.:0.14228   3rd Qu.:0.10621   3rd Qu.:0.07548  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##     benefits       telecommuting    has_company_logo has_questions   
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :0.01016   Median :0.0000   Median :1.0000   Median :0.0000  
##  Mean   :0.04717   Mean   :0.0429   Mean   :0.7953   Mean   :0.4917  
##  3rd Qu.:0.06638   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  employment_type  employment_typeContract employment_typeFull.time
##  Min.   :0.0000   Min.   :0.00000         Min.   :0.0000          
##  1st Qu.:0.0000   1st Qu.:0.00000         1st Qu.:0.0000          
##  Median :0.0000   Median :0.00000         Median :1.0000          
##  Mean   :0.1941   Mean   :0.08523         Mean   :0.6499          
##  3rd Qu.:0.0000   3rd Qu.:0.00000         3rd Qu.:1.0000          
##  Max.   :1.0000   Max.   :1.00000         Max.   :1.0000          
##  employment_typeOther employment_typePart.time employment_typeTemporary
##  Min.   :0.0000       Min.   :0.00000          Min.   :0.00000         
##  1st Qu.:0.0000       1st Qu.:0.00000          1st Qu.:0.00000         
##  Median :0.0000       Median :0.00000          Median :0.00000         
##  Mean   :0.0127       Mean   :0.04457          Mean   :0.01348         
##  3rd Qu.:0.0000       3rd Qu.:0.00000          3rd Qu.:0.00000         
##  Max.   :1.0000       Max.   :1.00000          Max.   :1.00000         
##  required_experienceAssociate required_experienceDirector
##  Min.   :0.0000               Min.   :0.00000            
##  1st Qu.:0.0000               1st Qu.:0.00000            
##  Median :0.0000               Median :0.00000            
##  Mean   :0.1285               Mean   :0.02176            
##  3rd Qu.:0.0000               3rd Qu.:0.00000            
##  Max.   :1.0000               Max.   :1.00000            
##  required_experienceEntry.level required_experienceExecutive
##  Min.   :0.0000                 Min.   :0.000000            
##  1st Qu.:0.0000                 1st Qu.:0.000000            
##  Median :0.0000                 Median :0.000000            
##  Mean   :0.1508                 Mean   :0.007886            
##  3rd Qu.:0.0000                 3rd Qu.:0.000000            
##  Max.   :1.0000                 Max.   :1.000000            
##  required_experienceInternship required_experienceMid.Senior.level
##  Min.   :0.00000               Min.   :0.000                      
##  1st Qu.:0.00000               1st Qu.:0.000                      
##  Median :0.00000               Median :0.000                      
##  Mean   :0.02131               Mean   :0.213                      
##  3rd Qu.:0.00000               3rd Qu.:0.000                      
##  Max.   :1.00000               Max.   :1.000                      
##  required_experienceNot.Applicable required_educationAssociate.Degree
##  Min.   :0.00000                   Min.   :0.00000                   
##  1st Qu.:0.00000                   1st Qu.:0.00000                   
##  Median :0.00000                   Median :0.00000                   
##  Mean   :0.06242                   Mean   :0.01532                   
##  3rd Qu.:0.00000                   3rd Qu.:0.00000                   
##  Max.   :1.00000                   Max.   :1.00000                   
##  required_educationBachelor.s.Degree required_educationCertification
##  Min.   :0.0000                      Min.   :0.000000               
##  1st Qu.:0.0000                      1st Qu.:0.000000               
##  Median :0.0000                      Median :0.000000               
##  Mean   :0.2878                      Mean   :0.009508               
##  3rd Qu.:1.0000                      3rd Qu.:0.000000               
##  Max.   :1.0000                      Max.   :1.000000               
##  required_educationDoctorate required_educationHigh.School.or.equivalent
##  Min.   :0.000000            Min.   :0.0000                             
##  1st Qu.:0.000000            1st Qu.:0.0000                             
##  Median :0.000000            Median :0.0000                             
##  Mean   :0.001454            Mean   :0.1163                             
##  3rd Qu.:0.000000            3rd Qu.:0.0000                             
##  Max.   :1.000000            Max.   :1.0000                             
##  required_educationMaster.s.Degree required_educationProfessional
##  Min.   :0.00000                   Min.   :0.000000              
##  1st Qu.:0.00000                   1st Qu.:0.000000              
##  Median :0.00000                   Median :0.000000              
##  Mean   :0.02327                   Mean   :0.004139              
##  3rd Qu.:0.00000                   3rd Qu.:0.000000              
##  Max.   :1.00000                   Max.   :1.000000              
##  required_educationSome.College.Coursework.Completed
##  Min.   :0.000000                                   
##  1st Qu.:0.000000                                   
##  Median :0.000000                                   
##  Mean   :0.005705                                   
##  3rd Qu.:0.000000                                   
##  Max.   :1.000000                                   
##  required_educationSome.High.School.Coursework required_educationUnspecified
##  Min.   :0.00000                               Min.   :0.00000              
##  1st Qu.:0.00000                               1st Qu.:0.00000              
##  Median :0.00000                               Median :0.00000              
##  Mean   :0.00151                               Mean   :0.07813              
##  3rd Qu.:0.00000                               3rd Qu.:0.00000              
##  Max.   :1.00000                               Max.   :1.00000              
##  required_educationVocational required_educationVocational...Degree
##  Min.   :0.00000              Min.   :0.0000000                    
##  1st Qu.:0.00000              1st Qu.:0.0000000                    
##  Median :0.00000              Median :0.0000000                    
##  Mean   :0.00274              Mean   :0.0003356                    
##  3rd Qu.:0.00000              3rd Qu.:0.0000000                    
##  Max.   :1.00000              Max.   :1.0000000                    
##  required_educationVocational...HS.Diploma   fraudulent      benefits_pipe     
##  Min.   :0.0000000                         Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.0000000                         1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.0000000                         Median :0.00000   Median :0.000000  
##  Mean   :0.0005034                         Mean   :0.04843   Mean   :0.001566  
##  3rd Qu.:0.0000000                         3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.0000000                         Max.   :1.00000   Max.   :1.000000  
##  benefits_hash     benefits_bonus    benefits_apply    benefits_benefits
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.0000   
##  Mean   :0.05543   Mean   :0.07131   Mean   :0.04234   Mean   :0.2012   
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000   
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   
##  slash_present     backslash_present    amp_present      exclam_present   
##  Min.   :0.00000   Min.   :0.0000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.0000000   Median :0.00000   Median :0.00000  
##  Mean   :0.09659   Mean   :0.0001119   Mean   :0.03356   Mean   :0.01102  
##  3rd Qu.:0.00000   3rd Qu.:0.0000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.0000000   Max.   :1.00000   Max.   :1.00000  
##   dash_present   multiple_spaces    parens_present    numbers_present  
##  Min.   :0.000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.000   Median :0.000000   Median :0.00000   Median :0.00000  
##  Mean   :0.169   Mean   :0.009228   Mean   :0.08853   Mean   :0.04787  
##  3rd Qu.:0.000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.000   Max.   :1.000000   Max.   :1.00000   Max.   :1.00000  
##  req_missing_or_short has_heavy_engineering_terms has_certification_terms
##  Min.   :0.0000       Min.   :0.00000             Min.   :0.0000         
##  1st Qu.:0.0000       1st Qu.:0.00000             1st Qu.:1.0000         
##  Median :0.0000       Median :0.00000             Median :1.0000         
##  Mean   :0.1763       Mean   :0.06549             Mean   :0.7724         
##  3rd Qu.:0.0000       3rd Qu.:0.00000             3rd Qu.:1.0000         
##  Max.   :1.0000       Max.   :1.00000             Max.   :1.0000         
##  has_years_experience has_degree_required has_tool_software_terms
##  Min.   :0.0000       Min.   :0.0000      Min.   :0.0000         
##  1st Qu.:0.0000       1st Qu.:0.0000      1st Qu.:0.0000         
##  Median :0.0000       Median :0.0000      Median :0.0000         
##  Mean   :0.3444       Mean   :0.4238      Mean   :0.3497         
##  3rd Qu.:1.0000       3rd Qu.:1.0000      3rd Qu.:1.0000         
##  Max.   :1.0000       Max.   :1.0000      Max.   :1.0000         
##  has_safety_regulation_terms req_contains_heavy_lists req_title_mismatch
##  Min.   :0.00000             Min.   :0.00000          Min.   :0.00000   
##  1st Qu.:0.00000             1st Qu.:0.00000          1st Qu.:0.00000   
##  Median :0.00000             Median :0.00000          Median :0.00000   
##  Mean   :0.03216             Mean   :0.02136          Mean   :0.06549   
##  3rd Qu.:0.00000             3rd Qu.:0.00000          3rd Qu.:0.00000   
##  Max.   :1.00000             Max.   :1.00000          Max.   :1.00000   
##  has_urgent_language has_no_experience_needed has_salary_info 
##  Min.   :0.00000     Min.   :0.000000         Min.   :0.0000  
##  1st Qu.:0.00000     1st Qu.:0.000000         1st Qu.:1.0000  
##  Median :0.00000     Median :0.000000         Median :1.0000  
##  Mean   :0.07959     Mean   :0.007159         Mean   :0.9705  
##  3rd Qu.:0.00000     3rd Qu.:0.000000         3rd Qu.:1.0000  
##  Max.   :1.00000     Max.   :1.000000         Max.   :1.0000  
##  has_qualification_terms has_benefits_stated has_technical_terms
##  Min.   :0.0000          Min.   :0.00000     Min.   :0.0000     
##  1st Qu.:0.0000          1st Qu.:0.00000     1st Qu.:0.0000     
##  Median :0.0000          Median :0.00000     Median :0.0000     
##  Mean   :0.1006          Mean   :0.09077     Mean   :0.2698     
##  3rd Qu.:0.0000          3rd Qu.:0.00000     3rd Qu.:1.0000     
##  Max.   :1.0000          Max.   :1.00000     Max.   :1.0000     
##  has_contact_number_or_whatsapp has_company_language
##  Min.   :0.00000                Min.   :0.0000      
##  1st Qu.:0.00000                1st Qu.:0.0000      
##  Median :0.00000                Median :1.0000      
##  Mean   :0.00179                Mean   :0.6383      
##  3rd Qu.:0.00000                3rd Qu.:1.0000      
##  Max.   :1.00000                Max.   :1.0000      
##  has_commission_only_language has_referral_bonus has_signing_bonus 
##  Min.   :0.00000              Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.00000              1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.00000              Median :0.000000   Median :0.000000  
##  Mean   :0.00453              Mean   :0.006432   Mean   :0.003132  
##  3rd Qu.:0.00000              3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.00000              Max.   :1.000000   Max.   :1.000000  
##    has_perks       has_relocation     salary_known1   
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.000000   Median :0.0000  
##  Mean   :0.05872   Mean   :0.009955   Mean   :0.1604  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.0000  
##  department_cleanaccounting department_cleanadministration department_cleanall
##  Min.   :0.000000           Min.   :0.000000               Min.   :0.0000000  
##  1st Qu.:0.000000           1st Qu.:0.000000               1st Qu.:0.0000000  
##  Median :0.000000           Median :0.000000               Median :0.0000000  
##  Mean   :0.002685           Mean   :0.005313               Mean   :0.0008948  
##  3rd Qu.:0.000000           3rd Qu.:0.000000               3rd Qu.:0.0000000  
##  Max.   :1.000000           Max.   :1.000000               Max.   :1.0000000  
##  department_cleanart.studio department_cleanbusiness_management
##  Min.   :0.0000000          Min.   :0.000000                   
##  1st Qu.:0.0000000          1st Qu.:0.000000                   
##  Median :0.0000000          Median :0.000000                   
##  Mean   :0.0006152          Mean   :0.004586                   
##  3rd Qu.:0.0000000          3rd Qu.:0.000000                   
##  Max.   :1.0000000          Max.   :1.000000                   
##  department_cleanclerical department_cleancommercial department_cleancreative
##  Min.   :0.00000          Min.   :0.000000           Min.   :0.000000        
##  1st Qu.:0.00000          1st Qu.:0.000000           1st Qu.:0.000000        
##  Median :0.00000          Median :0.000000           Median :0.000000        
##  Mean   :0.00151          Mean   :0.001007           Mean   :0.002685        
##  3rd Qu.:0.00000          3rd Qu.:0.000000           3rd Qu.:0.000000        
##  Max.   :1.00000          Max.   :1.000000           Max.   :1.000000        
##  department_cleancustomer.service department_cleancustomer_facing
##  Min.   :0.00000                  Min.   :0.000000               
##  1st Qu.:0.00000                  1st Qu.:0.000000               
##  Median :0.00000                  Median :0.000000               
##  Mean   :0.00755                  Mean   :0.006655               
##  3rd Qu.:0.00000                  3rd Qu.:0.000000               
##  Max.   :1.00000                  Max.   :1.000000               
##  department_cleandepartment department_cleandigital
##  Min.   :0.000000           Min.   :0.000000       
##  1st Qu.:0.000000           1st Qu.:0.000000       
##  Median :0.000000           Median :0.000000       
##  Mean   :0.001286           Mean   :0.000783       
##  3rd Qu.:0.000000           3rd Qu.:0.000000       
##  Max.   :1.000000           Max.   :1.000000       
##  department_cleaneducation_training department_cleanengagement
##  Min.   :0.000000                   Min.   :0.0000000         
##  1st Qu.:0.000000                   1st Qu.:0.0000000         
##  Median :0.000000                   Median :0.0000000         
##  Mean   :0.003244                   Mean   :0.0007271         
##  3rd Qu.:0.000000                   3rd Qu.:0.0000000         
##  Max.   :1.000000                   Max.   :1.0000000         
##  department_cleanengineering department_cleanfinance department_cleanhr
##  Min.   :0.00000             Min.   :0.000000        Min.   :0.00000   
##  1st Qu.:0.00000             1st Qu.:0.000000        1st Qu.:0.00000   
##  Median :0.00000             Median :0.000000        Median :0.00000   
##  Mean   :0.02864             Mean   :0.004139        Mean   :0.00481   
##  3rd Qu.:0.00000             3rd Qu.:0.000000        3rd Qu.:0.00000   
##  Max.   :1.00000             Max.   :1.000000        Max.   :1.00000   
##  department_cleaninternational.growth department_cleanit department_cleanlegal
##  Min.   :0.0000000                    Min.   :0.00000    Min.   :0.000000     
##  1st Qu.:0.0000000                    1st Qu.:0.00000    1st Qu.:0.000000     
##  Median :0.0000000                    Median :0.00000    Median :0.000000     
##  Mean   :0.0009508                    Mean   :0.01985    Mean   :0.001342     
##  3rd Qu.:0.0000000                    3rd Qu.:0.00000    3rd Qu.:0.000000     
##  Max.   :1.0000000                    Max.   :1.00000    Max.   :1.000000     
##  department_cleanmarketing department_cleanmerchandising department_cleanNA
##  Min.   :0.00000           Min.   :0.0000000             Min.   :0.0000    
##  1st Qu.:0.00000           1st Qu.:0.0000000             1st Qu.:0.0000    
##  Median :0.00000           Median :0.0000000             Median :1.0000    
##  Mean   :0.02478           Mean   :0.0006152             Mean   :0.6461    
##  3rd Qu.:0.00000           3rd Qu.:0.0000000             3rd Qu.:1.0000    
##  Max.   :1.00000           Max.   :1.0000000             Max.   :1.0000    
##  department_cleanoperations department_cleanoperations_logistics
##  Min.   :0.00000            Min.   :0.000000                    
##  1st Qu.:0.00000            1st Qu.:0.000000                    
##  Median :0.00000            Median :0.000000                    
##  Mean   :0.02047            Mean   :0.001566                    
##  3rd Qu.:0.00000            3rd Qu.:0.000000                    
##  Max.   :1.00000            Max.   :1.000000                    
##  department_cleanother department_cleanpermanent department_cleanproduct
##  Min.   :0.000         Min.   :0.0000000         Min.   :0.00000        
##  1st Qu.:0.000         1st Qu.:0.0000000         1st Qu.:0.00000        
##  Median :0.000         Median :0.0000000         Median :0.00000        
##  Mean   :0.126         Mean   :0.0007271         Mean   :0.01035        
##  3rd Qu.:0.000         3rd Qu.:0.0000000         3rd Qu.:0.00000        
##  Max.   :1.000         Max.   :1.0000000         Max.   :1.00000        
##  department_cleanproduction department_cleanqa department_cleanr.d
##  Min.   :0.000000           Min.   :0.000000   Min.   :0.000000   
##  1st Qu.:0.000000           1st Qu.:0.000000   1st Qu.:0.000000   
##  Median :0.000000           Median :0.000000   Median :0.000000   
##  Mean   :0.001846           Mean   :0.001007   Mean   :0.003076   
##  3rd Qu.:0.000000           3rd Qu.:0.000000   3rd Qu.:0.000000   
##  Max.   :1.000000           Max.   :1.000000   Max.   :1.000000   
##  department_cleanretail department_cleansales department_cleansquiz
##  Min.   :0.000000       Min.   :0.00000       Min.   :0.000000     
##  1st Qu.:0.000000       1st Qu.:0.00000       1st Qu.:0.000000     
##  Median :0.000000       Median :0.00000       Median :0.000000     
##  Mean   :0.002573       Mean   :0.03322       Mean   :0.001119     
##  3rd Qu.:0.000000       3rd Qu.:0.00000       3rd Qu.:0.000000     
##  Max.   :1.000000       Max.   :1.00000       Max.   :1.000000     
##  department_cleansupport department_cleantech_development
##  Min.   :0.000000        Min.   :0.00000                 
##  1st Qu.:0.000000        1st Qu.:0.00000                 
##  Median :0.000000        Median :0.00000                 
##  Mean   :0.001063        Mean   :0.02109                 
##  3rd Qu.:0.000000        3rd Qu.:0.00000                 
##  Max.   :1.000000        Max.   :1.00000                 
##  department_cleantechnology industry_cleanBusiness.Administration
##  Min.   :0.000000           Min.   :0.00000                      
##  1st Qu.:0.000000           1st Qu.:0.00000                      
##  Median :0.000000           Median :0.00000                      
##  Mean   :0.004418           Mean   :0.01359                      
##  3rd Qu.:0.000000           3rd Qu.:0.00000                      
##  Max.   :1.000000           Max.   :1.00000                      
##  industry_cleanConsulting..Professional.Services...Legal
##  Min.   :0.00000                                        
##  1st Qu.:0.00000                                        
##  Median :0.00000                                        
##  Mean   :0.01493                                        
##  3rd Qu.:0.00000                                        
##  Max.   :1.00000                                        
##  industry_cleanConsumer.Goods..Retail...Fashion
##  Min.   :0.00000                               
##  1st Qu.:0.00000                               
##  Median :0.00000                               
##  Mean   :0.04972                               
##  3rd Qu.:0.00000                               
##  Max.   :1.00000                               
##  industry_cleanDefense..Security...Aerospace industry_cleanEducation...Training
##  Min.   :0.000000                            Min.   :0.00000                   
##  1st Qu.:0.000000                            1st Qu.:0.00000                   
##  Median :0.000000                            Median :0.00000                   
##  Mean   :0.003635                            Mean   :0.05783                   
##  3rd Qu.:0.000000                            3rd Qu.:0.00000                   
##  Max.   :1.000000                            Max.   :1.00000                   
##  industry_cleanEnergy..Utilities...Environment
##  Min.   :0.00000                              
##  1st Qu.:0.00000                              
##  Median :0.00000                              
##  Mean   :0.04066                              
##  3rd Qu.:0.00000                              
##  Max.   :1.00000                              
##  industry_cleanFinance..Banking...Insurance
##  Min.   :0.0000                            
##  1st Qu.:0.0000                            
##  Median :0.0000                            
##  Mean   :0.0665                            
##  3rd Qu.:0.0000                            
##  Max.   :1.0000                            
##  industry_cleanGovernment..Nonprofit...Public.Sector
##  Min.   :0.00000                                    
##  1st Qu.:0.00000                                    
##  Median :0.00000                                    
##  Mean   :0.01113                                    
##  3rd Qu.:0.00000                                    
##  Max.   :1.00000                                    
##  industry_cleanHealthcare..Wellness...Life.Sciences
##  Min.   :0.00000                                   
##  1st Qu.:0.00000                                   
##  Median :0.00000                                   
##  Mean   :0.04553                                   
##  3rd Qu.:0.00000                                   
##  Max.   :1.00000                                   
##  industry_cleanHospitality..Travel...Leisure
##  Min.   :0.00000                            
##  1st Qu.:0.00000                            
##  Median :0.00000                            
##  Mean   :0.02601                            
##  3rd Qu.:0.00000                            
##  Max.   :1.00000                            
##  industry_cleanManufacturing...Industrial
##  Min.   :0.00000                         
##  1st Qu.:0.00000                         
##  Median :0.00000                         
##  Mean   :0.01879                         
##  3rd Qu.:0.00000                         
##  Max.   :1.00000                         
##  industry_cleanMedia..Entertainment...Creative industry_cleanNA
##  Min.   :0.00000                               Min.   :0.0000  
##  1st Qu.:0.00000                               1st Qu.:0.0000  
##  Median :0.00000                               Median :0.0000  
##  Mean   :0.08311                               Mean   :0.2742  
##  3rd Qu.:0.00000                               3rd Qu.:1.0000  
##  Max.   :1.00000                               Max.   :1.0000  
##  industry_cleanReal.Estate...Construction industry_cleanTechnology...Software
##  Min.   :0.00000                          Min.   :0.0000                     
##  1st Qu.:0.00000                          1st Qu.:0.0000                     
##  Median :0.00000                          Median :0.0000                     
##  Mean   :0.02377                          Mean   :0.2481                     
##  3rd Qu.:0.00000                          3rd Qu.:0.0000                     
##  Max.   :1.00000                          Max.   :1.0000                     
##  industry_cleanTransportation..Logistics...Supply.Chain function_cleanArts
##  Min.   :0.00000                                        Min.   :0.00000   
##  1st Qu.:0.00000                                        1st Qu.:0.00000   
##  Median :0.00000                                        Median :0.00000   
##  Mean   :0.01432                                        Mean   :0.03378   
##  3rd Qu.:0.00000                                        3rd Qu.:0.00000   
##  Max.   :1.00000                                        Max.   :1.00000   
##  function_cleanEducation function_cleanEngineering...Production
##  Min.   :0.00000         Min.   :0.0000                        
##  1st Qu.:0.00000         1st Qu.:0.0000                        
##  Median :0.00000         Median :0.0000                        
##  Mean   :0.01818         Mean   :0.1026                        
##  3rd Qu.:0.00000         3rd Qu.:0.0000                        
##  Max.   :1.00000         Max.   :1.0000                        
##  function_cleanFinance...Accounting function_cleanHealthcare...Science
##  Min.   :0.00000                    Min.   :0.00000                   
##  1st Qu.:0.00000                    1st Qu.:0.00000                   
##  Median :0.00000                    Median :0.00000                   
##  Mean   :0.02332                    Mean   :0.01969                   
##  3rd Qu.:0.00000                    3rd Qu.:0.00000                   
##  Max.   :1.00000                    Max.   :1.00000                   
##  function_cleanHuman.Resources...Training function_cleanLegal...Compliance
##  Min.   :0.00000                          Min.   :0.000000                
##  1st Qu.:0.00000                          1st Qu.:0.000000                
##  Median :0.00000                          Median :0.000000                
##  Mean   :0.02164                          Mean   :0.008837                
##  3rd Qu.:0.00000                          3rd Qu.:0.000000                
##  Max.   :1.00000                          Max.   :1.000000                
##  function_cleanManagement...Leadership function_cleanMarketing...Advertising
##  Min.   :0.00000                       Min.   :0.0000                       
##  1st Qu.:0.00000                       1st Qu.:0.0000                       
##  Median :0.00000                       Median :0.0000                       
##  Mean   :0.05934                       Mean   :0.0557                       
##  3rd Qu.:0.00000                       3rd Qu.:0.0000                       
##  Max.   :1.00000                       Max.   :1.0000                       
##  function_cleanNA function_cleanOther function_cleanResearch
##  Min.   :0.000    Min.   :0.00000     Min.   :0.000000      
##  1st Qu.:0.000    1st Qu.:0.00000     1st Qu.:0.000000      
##  Median :0.000    Median :0.00000     Median :0.000000      
##  Mean   :0.361    Mean   :0.01818     Mean   :0.002796      
##  3rd Qu.:1.000    3rd Qu.:0.00000     3rd Qu.:0.000000      
##  Max.   :1.000    Max.   :1.00000     Max.   :1.000000      
##  function_cleanSales...Customer.Service...IT
##  Min.   :0.0000                             
##  1st Qu.:0.0000                             
##  Median :0.0000                             
##  Mean   :0.2487                             
##  3rd Qu.:0.0000                             
##  Max.   :1.0000                             
##  function_cleanSupply.Chain...Logistics loc_country_newAT  loc_country_newAU
##  Min.   :0.000000                       Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.000000                       1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.000000                       Median :0.000000   Median :0.00000  
##  Mean   :0.004195                       Mean   :0.000783   Mean   :0.01197  
##  3rd Qu.:0.000000                       3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.000000                       Max.   :1.000000   Max.   :1.00000  
##  loc_country_newBE  loc_country_newBG   loc_country_newBR  loc_country_newCA
##  Min.   :0.000000   Min.   :0.0000000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.0000000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.000000   Median :0.0000000   Median :0.000000   Median :0.00000  
##  Mean   :0.006544   Mean   :0.0009508   Mean   :0.002013   Mean   :0.02556  
##  3rd Qu.:0.000000   3rd Qu.:0.0000000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.0000000   Max.   :1.000000   Max.   :1.00000  
##  loc_country_newCH   loc_country_newCN   loc_country_newCY   loc_country_newDE
##  Min.   :0.0000000   Min.   :0.0000000   Min.   :0.0000000   Min.   :0.00000  
##  1st Qu.:0.0000000   1st Qu.:0.0000000   1st Qu.:0.0000000   1st Qu.:0.00000  
##  Median :0.0000000   Median :0.0000000   Median :0.0000000   Median :0.00000  
##  Mean   :0.0008389   Mean   :0.0008389   Mean   :0.0006152   Mean   :0.02142  
##  3rd Qu.:0.0000000   3rd Qu.:0.0000000   3rd Qu.:0.0000000   3rd Qu.:0.00000  
##  Max.   :1.0000000   Max.   :1.0000000   Max.   :1.0000000   Max.   :1.00000  
##  loc_country_newDK  loc_country_newEE  loc_country_newEG  loc_country_newES 
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.000000   Median :0.000000  
##  Mean   :0.002349   Mean   :0.004027   Mean   :0.002908   Mean   :0.003691  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.000000   Max.   :1.000000  
##  loc_country_newFI  loc_country_newFR  loc_country_newGB loc_country_newGR
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.0000    Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.0000    1st Qu.:0.00000  
##  Median :0.000000   Median :0.000000   Median :0.0000    Median :0.00000  
##  Mean   :0.001622   Mean   :0.003915   Mean   :0.1333    Mean   :0.05257  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.0000    3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.0000    Max.   :1.00000  
##  loc_country_newHK  loc_country_newHU  loc_country_newID   loc_country_newIE 
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.0000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.0000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.0000000   Median :0.000000  
##  Mean   :0.004306   Mean   :0.000783   Mean   :0.0007271   Mean   :0.006376  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.0000000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.0000000   Max.   :1.000000  
##  loc_country_newIL  loc_country_newIN loc_country_newIT  loc_country_newJP 
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.00000   Median :0.000000   Median :0.000000  
##  Mean   :0.004027   Mean   :0.01544   Mean   :0.001734   Mean   :0.001119  
##  3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.000000  
##  loc_country_newLT  loc_country_newMT   loc_country_newMU  loc_country_newMX 
##  Min.   :0.000000   Min.   :0.0000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.0000000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.0000000   Median :0.000000   Median :0.000000  
##  Mean   :0.001286   Mean   :0.0007271   Mean   :0.000783   Mean   :0.001007  
##  3rd Qu.:0.000000   3rd Qu.:0.0000000   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.0000000   Max.   :1.000000   Max.   :1.000000  
##  loc_country_newMY  loc_country_newNA loc_country_newNL  loc_country_newNZ
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.000000   Median :0.00000   Median :0.000000   Median :0.00000  
##  Mean   :0.001174   Mean   :0.01935   Mean   :0.007103   Mean   :0.01862  
##  3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000  
##  loc_country_newOther loc_country_newPH  loc_country_newPK loc_country_newPL 
##  Min.   :0.000000     Min.   :0.000000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.000000     1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.000000     Median :0.000000   Median :0.00000   Median :0.000000  
##  Mean   :0.009508     Mean   :0.007383   Mean   :0.00151   Mean   :0.004251  
##  3rd Qu.:0.000000     3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.000000     Max.   :1.000000   Max.   :1.00000   Max.   :1.000000  
##  loc_country_newPT  loc_country_newQA  loc_country_newRO  loc_country_newRU 
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.000000   Median :0.000000  
##  Mean   :0.001007   Mean   :0.001174   Mean   :0.002573   Mean   :0.001119  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.000000   Max.   :1.000000  
##  loc_country_newSA   loc_country_newSE loc_country_newSG  loc_country_newTR  
##  Min.   :0.0000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.0000000  
##  1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000000  
##  Median :0.0000000   Median :0.00000   Median :0.000000   Median :0.0000000  
##  Mean   :0.0008389   Mean   :0.00274   Mean   :0.004474   Mean   :0.0009508  
##  3rd Qu.:0.0000000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.0000000  
##  Max.   :1.0000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.0000000  
##  loc_country_newUA   loc_country_newUS loc_country_newZA 
##  Min.   :0.0000000   Min.   :0.000     Min.   :0.000000  
##  1st Qu.:0.0000000   1st Qu.:0.000     1st Qu.:0.000000  
##  Median :0.0000000   Median :1.000     Median :0.000000  
##  Mean   :0.0007271   Mean   :0.596     Mean   :0.002237  
##  3rd Qu.:0.0000000   3rd Qu.:1.000     3rd Qu.:0.000000  
##  Max.   :1.0000000   Max.   :1.000     Max.   :1.000000

Step 3: Split Data

After cleaning the data, we can split the data. We will do 70-30 split. It is important that we split the data so that we can use a portion of the data to train the models and the remaining portion of the data to then test of efficacy of the models created. While we typically do a 50-50 split for the first split when making stacked/two level models, we are doing a 70-30 split in this case because we have almost 18,000 rows of data. There will be enough data to effectively train and test the decision tree model that will combine the six models being created at this point, even if a 70-30 split is used.

# Let's do a 70-30 split.

trainprop <- 0.7 # This is the proportion of data we want in our training data set
set.seed(12345) # Let's make the randomization "not so random"
train_rows <- sample(1:nrow(job), trainprop*nrow(job)) # Get the rows for the training data. We can use train_rows for both job and job_scaled as both data sets have the same number of rows/observations. 

# Train and test data for Logistic Regression, SVM, and Random Forest Models
job_train <- job[train_rows, ] # Store the training data
job_test <- job[-train_rows, ] # Store the testing data

# Train and test data for Decision Tree Model
job_dummy_train <- job_dummy[train_rows, ] # Store the training data
job_dummy_test <- job_dummy[-train_rows, ] # Store the testing data

# Train and test data for KNN and ANN models
job_scaled_train <- job_scaled[train_rows, ] # Store the training data
job_scaled_test <- job_scaled[-train_rows, ] # Store the testing data

# Let's do a quick check that random split worked (using dependent variable)
summary(job_train$fraudulent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04714 0.00000 1.00000
summary(job_test$fraudulent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05145 0.00000 1.00000
summary(job_dummy_train$fraudulent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04714 0.00000 1.00000
summary(job_dummy_test$fraudulent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05145 0.00000 1.00000
summary(job_scaled_train$fraudulent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04714 0.00000 1.00000
summary(job_scaled_test$fraudulent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05145 0.00000 1.00000
# The mean value is similar between the train and test data sets signifying the split was done successfully

Step 4 & 5: Build a Model + Predict

Now that the data has been split, it is time to create the various models. We will be loading in any necessary libraries, then creating the models based on the training data (job_train and job_scaled_train). Once the models are trained on the data and then evaluated, we can then use them in the future to predict whether job postings are fraudulent. We will also improve/optimize these models, using different levers based on the model.

Logistic Regression Model

For the logistic regression model, we can improve the model by adding combinations of predictors and changing the lr_pred_cutoff value. We want to extract the log probabilities (lr_pred) for the stacked model, so that we have the most “raw” version of the predictions/results.

# Build Model

# Since we are trying to predict fraudulent, will have that be our response variable. Since we are using all other columns to predict fraudulent, those will be our predictor variables. 

# Let's add some other combinations of predictors to increase the model's accuracy and sensitivity
# lr_model <- glm(fraudulent ~ . + industry_clean * function_clean
                               #+ description * benefits
                               #+ description * requirements 
                               #+ required_experience * required_education
                               #+ employment_type * required_experience
                               #+ employment_type * required_education
                               #+ required_experience * has_years_experience
                                #, data = job_train, family = "binomial")
# saveRDS(lr_model, "lrJobModel.RDS")

# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file

lr_model <- readRDS("lrJobModel.RDS")

# Predictor - Accuracy | Sensitivity
  # Base - 0.9627 | 0.42029
  # industry_clean * function_clean - 0.9623 | 0.44928
  # description * benefits - 0.9625 | 0.45290
  # description * requirements - 0.962 | 0.45652
  # required_experience * required_education - 0.9631 | 0.49275
  # employment_type * required_experience - 0.9618 | 0.49638
  # employment_type * required_education - 0.9532 | 0.52899
  # required_experience * has_years_experience - 0.9629 | 0.52899

# The following combinations of predictors decreased the model's sensitivity and/or accuracy
  #+ required_experience * has_no_experience_needed
  #+ benefits * has_benefits_stated
  #+ benefits * has_referral_bonus
  #+ benefits * has_signing_bonus

# 2nd Logistic Regression Model - using step function to optimize through all combinations of predictors (.*.)
# 2nd model was attempted, however due to number of variables/columns, it took too long

# m1 <- glm(fraudulent ~ . + .*., data = job_train, family = "binomial")
# saveRDS(m1, "LRJobModel_m1.RDS")
# lr_model_2 <- step(m1, direction = "backward") 
# saveRDS(lr_model_2, "LRJobModel_2.RDS")

# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file

# lr_model_2 <- readRDS("LRSporModel_2.RDS")

# Predict

# standard model
lr_pred <- predict(lr_model, job_test, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
lr_pred_cutoff <- 0.5

lr_bin_pred <- ifelse(lr_pred >= lr_pred_cutoff, 1, 0)

Decision Tree Model

For the decision tree model, we can improve the model by utilizing a cost_matrix to change the weighting/ratio between false positive and false negative that the model is trying to balance/optimize. We will also create a decision tree model to use for the stacked model, as we want the results (dt_pred) without having placed our thumb on the scale.

library(C50) # We need this library to run a decision tree model

# Build Model (without weights)
# dt_model <- C5.0(as.factor(fraudulent) ~ ., data = job_dummy_train)
# saveRDS(dt_model, "dtJobModel.RDS")

# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file

dt_model <- readRDS("dtJobModel.RDS")
plot(dt_model)

# Predict (without weights)
dt_pred <- predict(dt_model, job_dummy_test)

# Build Model (with weights)
cost_matrix <- matrix(c(0, 1, 6, 0), nrow = 2) 
cost_matrix # Check the matrix looks correct 
##      [,1] [,2]
## [1,]    0    6
## [2,]    1    0
# dt_cost_model <- C5.0(as.factor(fraudulent) ~ ., data = job_dummy_train, costs = cost_matrix)
# saveRDS(dt_cost_model, "dtJobModel_2.RDS")

# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file

dt_cost_model <- readRDS("dtJobModel_2.RDS")
plot(dt_cost_model)

# Predict (with weights)
dt_weights_pred <- predict(dt_cost_model, job_dummy_test)

SVM Model

For the SVM model, I can improve the model by changing the kernel. To find the optimal kernel, it will involve a guess-and-check method. I can also improve the model by changing the SVM_pred_cutoff value. We want to extract the probabilities (SVM_pred) for the stacked model, so that we have the most “raw” version of the predictions/results.

library(kernlab) # We need this library to run a SVM model

# Build Model
# SVM_model <- ksvm(fraudulent ~ ., data = job_train, kernel = "rbfdot")
# SVM_model_2 <- ksvm(fraudulent ~ ., data = job_train, kernel = "tanhdot")
# saveRDS(SVM_model, "SVMJobModel.RDS")
# saveRDS(SVM_model_2, "SVMJobModel_2.RDS")

# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file

SVM_model <- readRDS("SVMJobModel.RDS")
SVM_model_2 <- readRDS("SVMJobModel_2.RDS")

# Kernel - Sensitivity | Accuracy
  # rbfdot - 0.23551 | 0.9605
  # polydot - 0.184783 | 0.9556
  # vanilladot - 0.184783 | 0.9556
  # tanhdot - 0.55072 | 0.5021
  # laplacedot - 0.159420 | 0.9567
  # besseldot - 0.29348 | 0.4746
  # anovadot - 1.00000 | 0.0515
  # splinedot - NA | NA <- no results after letting run for 30 minutes

# From testing all kernels, we can see that the rbfdot had the highest accuracy and 4th highest sensitivity. We can also see that tanhdot produces the highest sensitivity (however, significantly lower accuracy). As a result, we decided to run these two SVM models to see if higher accuracy or sensitivity will result in a better result. 

# Predict
SVM_pred <- predict(SVM_model, job_test)
SVM_pred_2 <- predict(SVM_model_2, job_test)

SVM_pred_cutoff <- 0.1 
# We reduced the cutoff value to 0.1 to reduce the number of false negatives and increase the accuracy and sensitivity of the model.
SVM_pred_2_cutoff <- 0.3
# We reduced the cutoff value to 0.3 to reduce the number of false negatives and increase the accuracy and sensitivity of the model.

SVM_bin_prob <- ifelse(SVM_pred >= SVM_pred_cutoff, 1, 0)
SVM_bin_prob_2 <- ifelse(SVM_pred_2 >= SVM_pred_2_cutoff, 1, 0)

Random Forest

For the random forest model, we can improve the model by modifying the ntree and nodesize values.

library(randomForest) # We need this library to run a random forest model
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
# Build Model
# rf_model <- randomForest(as.factor(fraudulent) ~ ., data = job_train, ntree = 2000, nodesize = 5)
# saveRDS(rf_model, "rfJobModel.RDS")
# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file

rf_model <- readRDS("rfJobModel.RDS")

varImpPlot(rf_model) # From this plot, we can see that function, company_profile, and description are the biggest predictors of being fraudulent

# Predict
rf_pred <- predict(rf_model, job_test)

KNN Model

For the KNN model, we can improve the model by modifying the k value and changing the KNN_pred_cutoff value. We want to extract the probabilities (KNN_prob) for the stacked model, so that we have the most “raw” version of the predictions/results.

library(class)

# Identify predictor columns (all except the target)
predictor_cols <- colnames(job_scaled_train)[colnames(job_scaled_train) != "fraudulent"]

# Train and test predictor matrices
train_X <- job_scaled_train[, predictor_cols]
test_X  <- job_scaled_test[, predictor_cols]

# Target vector
train_y <- job_scaled_train$fraudulent

# Run KNN
# KNN_pred <- knn(train = train_X, test = test_X, cl = train_y, k = 4, prob = TRUE) # We optimized k over the range [2, 100] for accuracy and sensitivity
# saveRDS(KNN_pred, "KNNJobModel.RDS")

# k = # | Accuracy | Sensitivity
# k = 100 | 0.9485 | 0.000000
# k = 75 | 0.9485 | 0.000000
# k = 50 | 0.9508 | 0.043478
# k = 40 | 0.9534 | 0.105072
# k = 30 | 0.9603 | 0.25725
# k = 20 | 0.9681 | 0.43478
# k = 10 | 0.9724 | 0.57971
# k = 5 | 0.9754 | 0.65580
# k = 4 | 0.9735 | 0.71377 # highest sensitivity
# k = 3 | 0.9767 | 0.70652 # highest accuracy
# k = 2 | 0.9674 | 0.81522

# We choose k = 4 since k = 4 and k = 3 have similar results. However, the increase in sensitivity from going from 3 to 4 is larger that the decrease in accuracy from 3 to 4. 

# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file

KNN_pred <- readRDS("KNNJobModel.RDS")

# Convert KNN "prob" attribute to numeric probabilities
KNN_prob <- ifelse(KNN_pred == "1",
                   attr(KNN_pred, "prob"),
                   1 - attr(KNN_pred, "prob"))

# Apply cutoff
KNN_pred_cutoff <- 0.3
# We reduced the cutoff value to 0.3 to reduce the number of false negatives and increase the sensitivity of the model.
KNN_bin_prob <- ifelse(KNN_prob >= KNN_pred_cutoff, 1, 0)
#library(class) # We need this library to run a KNN model

# Build Model + Predict
#KNN_pred <- knn(train = job_scaled_train[, -100],
#                  test = job_scaled_test[, -100],
#                  cl = job_scaled_train[, -100],
 #                 k = 15, prob = TRUE) 

#KNN_prob <- ifelse(KNN_pred == "1", attr(KNN_pred, "prob"), 1 - attr(KNN_pred, "prob")) #to get most raw data

#KNN_pred_cutoff <- 0.5

#KNN_bin_prob <- ifelse(KNN_prob >= KNN_pred_cutoff, 1, 0)

ANN Model

For the ANN model, we can improve the model by changing the number of nodes (e.g., hidden = c(5, 3, 2)) and changing the ANN_pred_cutoff value. We want to extract the fractional values (ANN_pred) for the stacked model, so that we have the most “raw” version of the predictions/results.

library(neuralnet) # We need this library to run a ANN model
## 
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
## 
##     compute
# Build Model
set.seed(12345) # Let's make the randomization "not so random"
#ANN_model_1 <- neuralnet(fraudulent ~ ., data = job_scaled_train, lifesign = "full", stepmax = 1e8)
#ANN_model_2 <- neuralnet(fraudulent ~ ., data = job_scaled_train, lifesign = "full", stepmax = 1e8, hidden = c(3, 2))
#ANN_model_3 <- neuralnet(fraudulent ~ ., data = job_scaled_train, lifesign = "full", stepmax = 1e8, hidden = c(5, 3, 2))
#ANN_model_4 <- neuralnet(fraudulent ~ ., data = job_scaled_train, lifesign = "full", stepmax = 1e8, hidden = c(5, 3, 3, 2))
#saveRDS(ANN_model_1, "ANNJobModel_1.RDS")
#saveRDS(ANN_model_2, "ANNJobModel_2.RDS")
#saveRDS(ANN_model_3, "ANNJobModel_3.RDS")
#saveRDS(ANN_model_4, "ANNJobModel_4.RDS")

# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file

ANN_model_1 <- readRDS("ANNJobModel_1.RDS")
ANN_model_2 <- readRDS("ANNJobModel_2.RDS")
ANN_model_3 <- readRDS("ANNJobModel_3.RDS")
ANN_model_4 <- readRDS("ANNJobModel_4.RDS")

plot(ANN_model_1, rep = "best")

plot(ANN_model_2, rep = "best")

plot(ANN_model_3, rep = "best")

plot(ANN_model_4, rep = "best")

# Predict
ANN_pred <- predict(ANN_model_1, job_scaled_test)
ANN_pred_cutoff <- 0.3 # We reduced the cutoff value to 0.3 to reduce the number of false negatives and increase the accuracy and sensitivity of the model.
ANN_bin_pred <- ifelse(ANN_pred >= ANN_pred_cutoff, 1, 0)

ANN_pred_2 <- predict(ANN_model_2, job_scaled_test)
ANN_pred_cutoff_2 <- 0.3 # We reduced the cutoff value to 0.3 to increase the sensitivity of the model.
ANN_bin_pred_2 <- ifelse(ANN_pred_2 >= ANN_pred_cutoff_2, 1, 0)

ANN_pred_3 <- predict(ANN_model_3, job_scaled_test)
ANN_pred_cutoff_3 <- 0.2 # We reduced the cutoff value to 0.2 to reduce the number of false negatives and increase the sensitivity of the model.
ANN_bin_pred_3 <- ifelse(ANN_pred_3 >= ANN_pred_cutoff_3, 1, 0)

ANN_pred_4 <- predict(ANN_model_4, job_scaled_test)
ANN_pred_cutoff_4 <- 0.2 # We reduced the cutoff value to 0.2 to reduce the number of false negatives and increase the sensitivity of the model.
ANN_bin_pred_4 <- ifelse(ANN_pred_4 >= ANN_pred_cutoff_4, 1, 0)

Step 5.5: Combine, Split, Build, & Predict (Stacked Model)

We will now take the prediction results of the various models and create a new data set to build the stacked model off of. Similar to the normal work flow, we will split the data, then use it to the build the 2nd level decision tree model. Finally, we will use the decision tree model produced to predict the test data set. We will also used a cost matrix to optimize the model.

# Combine the predictions of the 7 individual models into a new data frame
stacked_data <- data.frame(
      lr_pred = c(lr_pred),
      dt_pred = c(dt_pred), 
      SVM_pred = c(SVM_pred),
      SVM_pred_2 = c(SVM_pred_2),
      rf_pred = c(rf_pred),
      KNN_pred = c(KNN_prob), 
      ANN_pred = c(ANN_pred_3),
      actual = c(job_test$fraudulent)
    )

# Split the data in to train and test data

# Let's do a 50-50 split, again (since there is such a small amount of data). We want there to be a somewhat decent amount of test data
trainprop <- 0.5 # This is the proportion of data we want in our training data set
set.seed(12345) # Let's make the randomization "not so random"
stacked_train_rows <- sample(1:nrow(stacked_data), trainprop*nrow(stacked_data)) # Get the rows for the training data. We can use train_rows for both churn_data and churn_scaled as both data sets have the same number of rows/observations. 

# Train and test data for the stacked model
stacked_train <- stacked_data[stacked_train_rows, ] # Store the training data
stacked_test <- stacked_data[-stacked_train_rows, ] # Store the testing data

# Let's do a quick check that random split worked (using dependent variable)
summary(stacked_train$actual)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05295 0.00000 1.00000
summary(stacked_test$actual)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04996 0.00000 1.00000
# The mean value is similar between the train and test data sets signifying the split was done successfully

# Build and predict a decision tree model as a model stacked on top the other five models 

# Build Model (without weights)
# stacked_unweighted_model <- C5.0(as.factor(actual) ~ ., data = stacked_train)
# saveRDS(stacked_unweighted_model, "stackedJobModel.RDS")

# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file

stacked_unweighted_model <- readRDS("stackedJobModel.RDS")
plot(stacked_unweighted_model)

# Build Model (with weights)
stacked_cost_matrix <- matrix(c(0, 1, 5, 0), nrow = 2) 

stacked_cost_matrix # Check the matrix looks correct 
##      [,1] [,2]
## [1,]    0    5
## [2,]    1    0
# stacked_model <- C5.0(as.factor(actual) ~ ., data = stacked_train, costs = stacked_cost_matrix)
# saveRDS(stacked_model, "stackedJobModel_2.RDS")

# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file

stacked_model <- readRDS("stackedJobModel_2.RDS")
plot(stacked_model)

# Predict (without weights)
stacked_unweighted_pred <- predict(stacked_unweighted_model, stacked_test)

# Predict (with weights)
stacked_pred <- predict(stacked_model, stacked_test)

Step 6: Evaluate Model

Now that the models are created, we can evaluate them by creating confusion matrices. It will be important to look at the accuracy and sensitivity of the model. We also want to make sure that we am minimizing false negatives, as those are much more costly that false positives. [explain what false negatives and false positives are and why false negatives are worse]

# Let's build some confusion matrices

library(caret) # We need this library to build a confusion matrix

library(knitr) # Load in library so that the table is formatted in an easy to read manner

Logistic Regression Model

cm_lr <- confusionMatrix(as.factor(lr_bin_pred), as.factor(job_test$fraudulent), positive = "1")
cm_lr
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5019  130
##          1   69  146
##                                           
##                Accuracy : 0.9629          
##                  95% CI : (0.9575, 0.9678)
##     No Information Rate : 0.9485          
##     P-Value [Acc > NIR] : 3.644e-07       
##                                           
##                   Kappa : 0.5756          
##                                           
##  Mcnemar's Test P-Value : 2.107e-05       
##                                           
##             Sensitivity : 0.52899         
##             Specificity : 0.98644         
##          Pos Pred Value : 0.67907         
##          Neg Pred Value : 0.97475         
##              Prevalence : 0.05145         
##          Detection Rate : 0.02722         
##    Detection Prevalence : 0.04008         
##       Balanced Accuracy : 0.75771         
##                                           
##        'Positive' Class : 1               
## 

Decision Tree Model

# Decision Tree Model (without weights)
cm_unweighted_dt <- confusionMatrix(as.factor(dt_pred), as.factor(job_test$fraudulent), positive = "1")
# Looking at the confusion matrix, we need to apply a cost matrix. In this situation, the false negatives are extremely costly. As such, we want to apply a cost matrix that weights the false negatives appropriately. I will apply a cost matrix that costs false negatives at 6:1 ratio to false positives to reduce the number of false negatives. However, this will increase the number of false positives. However, we are not as concerned about this as it is less costly to deal with jobs posts that are falsely flagged than fraudulent posts that are missed. 
cm_unweighted_dt
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5038   96
##          1   50  180
##                                          
##                Accuracy : 0.9728         
##                  95% CI : (0.9681, 0.977)
##     No Information Rate : 0.9485         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6973         
##                                          
##  Mcnemar's Test P-Value : 0.0001959      
##                                          
##             Sensitivity : 0.65217        
##             Specificity : 0.99017        
##          Pos Pred Value : 0.78261        
##          Neg Pred Value : 0.98130        
##              Prevalence : 0.05145        
##          Detection Rate : 0.03356        
##    Detection Prevalence : 0.04288        
##       Balanced Accuracy : 0.82117        
##                                          
##        'Positive' Class : 1              
## 
# Decision Tree Model (with weights)
cm_dt <- confusionMatrix(as.factor(dt_weights_pred), as.factor(job_test$fraudulent), positive = "1")
cm_dt
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4937   47
##          1  151  229
##                                          
##                Accuracy : 0.9631         
##                  95% CI : (0.9577, 0.968)
##     No Information Rate : 0.9485         
##     P-Value [Acc > NIR] : 2.557e-07      
##                                          
##                   Kappa : 0.679          
##                                          
##  Mcnemar's Test P-Value : 2.482e-13      
##                                          
##             Sensitivity : 0.82971        
##             Specificity : 0.97032        
##          Pos Pred Value : 0.60263        
##          Neg Pred Value : 0.99057        
##              Prevalence : 0.05145        
##          Detection Rate : 0.04269        
##    Detection Prevalence : 0.07084        
##       Balanced Accuracy : 0.90002        
##                                          
##        'Positive' Class : 1              
## 

SVM Models

cm_SVM <- confusionMatrix(as.factor(SVM_bin_prob), as.factor(job_test$fraudulent), positive = "1")
cm_SVM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5052  126
##          1   36  150
##                                           
##                Accuracy : 0.9698          
##                  95% CI : (0.9649, 0.9742)
##     No Information Rate : 0.9485          
##     P-Value [Acc > NIR] : 1.968e-14       
##                                           
##                   Kappa : 0.6342          
##                                           
##  Mcnemar's Test P-Value : 2.700e-12       
##                                           
##             Sensitivity : 0.54348         
##             Specificity : 0.99292         
##          Pos Pred Value : 0.80645         
##          Neg Pred Value : 0.97567         
##              Prevalence : 0.05145         
##          Detection Rate : 0.02796         
##    Detection Prevalence : 0.03468         
##       Balanced Accuracy : 0.76820         
##                                           
##        'Positive' Class : 1               
## 
cm_SVM_2 <- confusionMatrix(as.factor(SVM_bin_prob_2), as.factor(job_test$fraudulent), positive = "1")
cm_SVM_2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2537  124
##          1 2551  152
##                                           
##                Accuracy : 0.5013          
##                  95% CI : (0.4878, 0.5148)
##     No Information Rate : 0.9485          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0096          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.55072         
##             Specificity : 0.49862         
##          Pos Pred Value : 0.05623         
##          Neg Pred Value : 0.95340         
##              Prevalence : 0.05145         
##          Detection Rate : 0.02834         
##    Detection Prevalence : 0.50391         
##       Balanced Accuracy : 0.52467         
##                                           
##        'Positive' Class : 1               
## 

Random Forest

cm_rf <- confusionMatrix(as.factor(rf_pred), as.factor(job_test$fraudulent), positive = "1")
cm_rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5085   93
##          1    3  183
##                                           
##                Accuracy : 0.9821          
##                  95% CI : (0.9782, 0.9855)
##     No Information Rate : 0.9485          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7832          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.66304         
##             Specificity : 0.99941         
##          Pos Pred Value : 0.98387         
##          Neg Pred Value : 0.98204         
##              Prevalence : 0.05145         
##          Detection Rate : 0.03412         
##    Detection Prevalence : 0.03468         
##       Balanced Accuracy : 0.83123         
##                                           
##        'Positive' Class : 1               
## 

KNN Model

cm_KNN <- confusionMatrix(as.factor(KNN_bin_prob), as.factor(job_scaled_test[,35]), positive = "1")
cm_KNN
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5016   75
##          1   72  201
##                                           
##                Accuracy : 0.9726          
##                  95% CI : (0.9679, 0.9768)
##     No Information Rate : 0.9485          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7178          
##                                           
##  Mcnemar's Test P-Value : 0.869           
##                                           
##             Sensitivity : 0.72826         
##             Specificity : 0.98585         
##          Pos Pred Value : 0.73626         
##          Neg Pred Value : 0.98527         
##              Prevalence : 0.05145         
##          Detection Rate : 0.03747         
##    Detection Prevalence : 0.05089         
##       Balanced Accuracy : 0.85705         
##                                           
##        'Positive' Class : 1               
## 

ANN Model

cm_ANN <- confusionMatrix(as.factor(ANN_bin_pred), as.factor(job_scaled_test$fraudulent), positive = "1")
cm_ANN
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5015   88
##          1   73  188
##                                           
##                Accuracy : 0.97            
##                  95% CI : (0.9651, 0.9744)
##     No Information Rate : 0.9485          
##     P-Value [Acc > NIR] : 1.12e-14        
##                                           
##                   Kappa : 0.6844          
##                                           
##  Mcnemar's Test P-Value : 0.2699          
##                                           
##             Sensitivity : 0.68116         
##             Specificity : 0.98565         
##          Pos Pred Value : 0.72031         
##          Neg Pred Value : 0.98276         
##              Prevalence : 0.05145         
##          Detection Rate : 0.03505         
##    Detection Prevalence : 0.04866         
##       Balanced Accuracy : 0.83341         
##                                           
##        'Positive' Class : 1               
## 
cm_ANN_2 <- confusionMatrix(as.factor(ANN_bin_pred_2), as.factor(job_scaled_test$fraudulent), positive = "1")
cm_ANN_2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4970   90
##          1  118  186
##                                           
##                Accuracy : 0.9612          
##                  95% CI : (0.9557, 0.9662)
##     No Information Rate : 0.9485          
##     P-Value [Acc > NIR] : 7.054e-06       
##                                           
##                   Kappa : 0.6209          
##                                           
##  Mcnemar's Test P-Value : 0.06119         
##                                           
##             Sensitivity : 0.67391         
##             Specificity : 0.97681         
##          Pos Pred Value : 0.61184         
##          Neg Pred Value : 0.98221         
##              Prevalence : 0.05145         
##          Detection Rate : 0.03468         
##    Detection Prevalence : 0.05667         
##       Balanced Accuracy : 0.82536         
##                                           
##        'Positive' Class : 1               
## 
cm_ANN_3 <- confusionMatrix(as.factor(ANN_bin_pred_3), as.factor(job_scaled_test$fraudulent), positive = "1")
cm_ANN_3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4976   79
##          1  112  197
##                                           
##                Accuracy : 0.9644          
##                  95% CI : (0.9591, 0.9692)
##     No Information Rate : 0.9485          
##     P-Value [Acc > NIR] : 1.855e-08       
##                                           
##                   Kappa : 0.6547          
##                                           
##  Mcnemar's Test P-Value : 0.02059         
##                                           
##             Sensitivity : 0.71377         
##             Specificity : 0.97799         
##          Pos Pred Value : 0.63754         
##          Neg Pred Value : 0.98437         
##              Prevalence : 0.05145         
##          Detection Rate : 0.03673         
##    Detection Prevalence : 0.05761         
##       Balanced Accuracy : 0.84588         
##                                           
##        'Positive' Class : 1               
## 
cm_ANN_4 <- confusionMatrix(as.factor(ANN_bin_pred_4), as.factor(job_scaled_test$fraudulent), positive = "1")
cm_ANN_4
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4955   82
##          1  133  194
##                                          
##                Accuracy : 0.9599         
##                  95% CI : (0.9543, 0.965)
##     No Information Rate : 0.9485         
##     P-Value [Acc > NIR] : 5.39e-05       
##                                          
##                   Kappa : 0.6224         
##                                          
##  Mcnemar's Test P-Value : 0.0006497      
##                                          
##             Sensitivity : 0.70290        
##             Specificity : 0.97386        
##          Pos Pred Value : 0.59327        
##          Neg Pred Value : 0.98372        
##              Prevalence : 0.05145        
##          Detection Rate : 0.03617        
##    Detection Prevalence : 0.06096        
##       Balanced Accuracy : 0.83838        
##                                          
##        'Positive' Class : 1              
## 

ANN Model Comparison

ANN_Comparison <- data.frame(
  "Model" = c("ANN_1", "ANN_2", "ANN_3", "ANN_4"),
  "Accuracy" = c(round(cm_ANN$overall["Accuracy"], 4), round(cm_ANN_2$overall["Accuracy"], 4), round(cm_ANN_3$overall["Accuracy"], 4), round(cm_ANN_4$overall["Accuracy"], 4)),
  "Sensitivity" = c(round(cm_ANN$byClass["Sensitivity"], 4), round(cm_ANN_2$byClass["Sensitivity"], 4), round(cm_ANN_3$byClass["Sensitivity"], 4), round(cm_ANN_4$byClass["Sensitivity"], 4)),
  "Kappa" = c(round(cm_ANN$overall["Kappa"], 4), round(cm_ANN_2$overall["Kappa"], 4), round(cm_ANN_3$overall["Kappa"], 4), round(cm_ANN_4$overall["Kappa"])),
  "P-Value" = c(round(cm_ANN$overall["AccuracyPValue"], 4), round(cm_ANN_2$overall["AccuracyPValue"], 4), round(cm_ANN_3$overall["AccuracyPValue"], 4), round(cm_ANN_4$overall["AccuracyPValue"], 4))
)

kable(ANN_Comparison, format = "markdown")
Model Accuracy Sensitivity Kappa P.Value
ANN_1 0.9700 0.6812 0.6844 0e+00
ANN_2 0.9612 0.6739 0.6209 0e+00
ANN_3 0.9644 0.7138 0.6547 0e+00
ANN_4 0.9599 0.7029 1.0000 1e-04

Looking at these 4 models, I will use ANN_3, as it has the highest sensitivity of 0.7138. However, this higher sensitivity does come at a trade-off of lower accuracy. I will use this model for the implementation step, along with feeding it into the stacked model.

Stacked Model

# Raw data values
cm_unweight_stacked <- confusionMatrix(as.factor(stacked_unweighted_pred), as.factor(stacked_test$actual), positive = "1")
# Looking at the confusion matrix, we need to apply a cost matrix. In this situation, the false negatives are extremely costly. As such, we want to apply a cost matrix that weights the false negatives appropriately. I will apply a cost matrix that costs false negatives at 5:1 ratio to false positives to reduce the number of false negatives. However, this will increase the number of false positives. However, we are not as concerned about this as it is less costly to deal with jobs posts that are falsely flagged than fraudulent posts that are missed. 
cm_unweight_stacked
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2534   30
##          1   14  104
##                                          
##                Accuracy : 0.9836         
##                  95% CI : (0.978, 0.9881)
##     No Information Rate : 0.95           
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.8168         
##                                          
##  Mcnemar's Test P-Value : 0.02374        
##                                          
##             Sensitivity : 0.77612        
##             Specificity : 0.99451        
##          Pos Pred Value : 0.88136        
##          Neg Pred Value : 0.98830        
##              Prevalence : 0.04996        
##          Detection Rate : 0.03878        
##    Detection Prevalence : 0.04400        
##       Balanced Accuracy : 0.88531        
##                                          
##        'Positive' Class : 1              
## 
cm_stacked <- confusionMatrix(as.factor(stacked_pred), as.factor(stacked_test$actual), positive = "1")
cm_stacked
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2496   22
##          1   52  112
##                                           
##                Accuracy : 0.9724          
##                  95% CI : (0.9655, 0.9783)
##     No Information Rate : 0.95            
##     P-Value [Acc > NIR] : 5.264e-09       
##                                           
##                   Kappa : 0.7372          
##                                           
##  Mcnemar's Test P-Value : 0.0007485       
##                                           
##             Sensitivity : 0.83582         
##             Specificity : 0.97959         
##          Pos Pred Value : 0.68293         
##          Neg Pred Value : 0.99126         
##              Prevalence : 0.04996         
##          Detection Rate : 0.04176         
##    Detection Prevalence : 0.06115         
##       Balanced Accuracy : 0.90771         
##                                           
##        'Positive' Class : 1               
## 

Model Comparison

Model_Comparison <- data.frame(
  "Model" = c("Logistic Regression", "Decision Tree (weights)", "SVM", "SVM_2", "Random Forest", "KNN", "ANN", "Stacked Model"),
  "Accuracy" = c(round(cm_lr$overall["Accuracy"], 4), 
                 round(cm_dt$overall["Accuracy"], 4), 
                 round(cm_SVM$overall["Accuracy"], 4), 
                 round(cm_SVM_2$overall["Accuracy"], 4), 
                 round(cm_rf$overall["Accuracy"], 4), 
                 round(cm_KNN$overall["Accuracy"], 4), 
                 round(cm_ANN$overall["Accuracy"], 4), 
                 round(cm_stacked$overall["Accuracy"], 4)),
  "Sensitivity" = c(round(cm_lr$byClass["Sensitivity"], 4), 
                    round(cm_dt$byClass["Sensitivity"], 4), 
                    round(cm_SVM$byClass["Sensitivity"], 4), 
                    round(cm_SVM_2$byClass["Sensitivity"], 4),
                    round(cm_rf$byClass["Sensitivity"], 4), 
                    round(cm_KNN$byClass["Sensitivity"], 4), 
                    round(cm_ANN$byClass["Sensitivity"], 4), 
                    round(cm_stacked$byClass["Sensitivity"], 4)),
  "Kappa" = c(round(cm_lr$overall["Kappa"], 4), 
              round(cm_dt$overall["Kappa"], 4), 
              round(cm_SVM$overall["Kappa"], 4), 
              round(cm_SVM_2$overall["Kappa"], 4), 
              round(cm_rf$overall["Kappa"], 4), 
              round(cm_KNN$overall["Kappa"], 4), 
              round(cm_ANN$overall["Kappa"], 4), 
              round(cm_stacked$overall["Kappa"], 4)),
  "P-Value" = c(round(cm_lr$overall["AccuracyPValue"], 4), 
                round(cm_dt$overall["AccuracyPValue"], 4), 
                round(cm_SVM$overall["AccuracyPValue"], 4), 
                round(cm_SVM_2$overall["AccuracyPValue"], 4), 
                round(cm_rf$overall["AccuracyPValue"], 4), 
                round(cm_KNN$overall["AccuracyPValue"], 4), 
                round(cm_ANN$overall["AccuracyPValue"], 4), 
                round(cm_stacked$overall["AccuracyPValue"], 4))
)

kable(Model_Comparison, format = "markdown")
Model Accuracy Sensitivity Kappa P.Value
Logistic Regression 0.9629 0.5290 0.5756 0
Decision Tree (weights) 0.9631 0.8297 0.6790 0
SVM 0.9698 0.5435 0.6342 0
SVM_2 0.5013 0.5507 0.0096 1
Random Forest 0.9821 0.6630 0.7832 0
KNN 0.9726 0.7283 0.7178 0
ANN 0.9700 0.6812 0.6844 0
Stacked Model 0.9724 0.8358 0.7372 0
  • The Random Forest model has the highest accuracy
  • The Stacked model has the highest sensitivity
  • The Random Forest model has the highest kappa
  • The KNN model has the smallest p-value

Stacked Model
Comparing the stacked model to the individual models, the stacked model has the third highest accuracy (0.9724). It does have the highest sensitivity (0.8358). It has the second highest kappa (just behind the Random Forest model). Lastly, while it does not have the smallest p-value, it does have a p-value of 0, showing the model is significant.

Step 7: Implement Model

Now that the models are created and evaluated, it is time to implement the models and see the financial impacts. It is important to also calculate the financial impact of having no model. I make assumptions (below) for the financial data I am missing.

Assumptions

  • There are no costs associated with true negatives and true positives
  • For false positives, these are job postings that our model flags as fraudulent that are actually real. As a result, the company would then have then had to submit a request to get their posting unflagged and placed back on the sight. This would cost a total of $35.
    • This would cost $20 in processing the unflagging request
    • This would “cost” $15 in reputation hurt to job posting site as companies do not want to have to post on sites that they have to worry about falsely getting flagged as fraudulent on.
  • For false negatives, these are job postings that our model does not flag as fraudulent but actually are. As a result, the site users who apply to this posting risk their time being wasted or even worse, their information being stolen. This would cost a total of $500.
    • This would cost $100 in conducting a formal investigation to remove this post (after the fact) and then investigate any other fraudulent posts related to this on.
    • This would “cost” $400 in harm to the many hundreds of applicants who applied. If there were 500 applications, it would be about $0.80 per applicant.
  • Being able to identify fraudulent posts increase confidence in the sight. As a result, more companies will post jobs and more people will use the site to find and apply to jobs. This is valued at fraud_identification_rate * $1,000,000

Note: Results will be scaled up to 100,000 posts so (they are comparable)

# Assumptions
fp_cost <- 35
fn_cost <- 500
num_posts = 100000
nm_scalar = num_posts/nrow(job)
m_scalar = num_posts/nrow(job_test)
sm_scalar = num_posts/nrow(stacked_test)
bonus <- 1000000

No Model

nm_frad = sum(job$fraudulent) * nm_scalar
nm_total_cost = nm_frad * fn_cost

With no model, there would be no way of flagging fraudulent posts ahead of time. As a result, we end up treating all posts as non-fraudulent. Therefore, we miss 4,843 (all) fraudulent posts, costing $2,421,700.

Logistic Regression Model

lr_fp_cost = fp_cost * cm_lr$table["1", "0"] * m_scalar
lr_fn_cost = fn_cost * cm_lr$table["0", "1"] * m_scalar
lr_suc_rate = (cm_lr$table["1", "1"] * m_scalar) / nm_frad
lr_bonus = bonus * lr_suc_rate
lr_total_cost = lr_fp_cost + lr_fn_cost - lr_bonus

With a model, we ended up with 1,286 false positives, costing $45,022.37. We ended up with 2,424 false negatives, costing $1,211,782. This model has a fraud success identification rate of 0.56, resulting in a benefit of $561,970.8. So, the total cost the job posting site company incurs is $694,833.9.

Decision Tree Model

dt_fp_cost = fp_cost * cm_dt$table["1", "0"] * m_scalar
dt_fn_cost = fn_cost * cm_dt$table["0", "1"] * m_scalar
dt_suc_rate = (cm_dt$table["1", "1"] * m_scalar) / nm_frad
dt_bonus = bonus * dt_suc_rate
dt_total_cost = dt_fp_cost + dt_fn_cost - dt_bonus

With a model, we ended up with 2,815 false positives, costing $98,527.22. We ended up with 876 false negatives, costing $438,105.9. This model has a fraud success identification rate of 0.88, resulting in a benefit of $881,447.3. So, the total cost the job posting site company incurs is $-344,814.2.

SVM Model

SVM_fp_cost = fp_cost * cm_SVM$table["1", "0"] * m_scalar
SVM_fn_cost = fn_cost * cm_SVM$table["0", "1"] * m_scalar
SVM_suc_rate = (cm_SVM$table["1", "1"] * m_scalar) / nm_frad
SVM_bonus = bonus * SVM_suc_rate
SVM_total_cost = SVM_fp_cost + SVM_fn_cost - SVM_bonus

With a model, we ended up with 671 false positives, costing $23,489.93. We ended up with 2,349 false negatives, costing $1,174,497. This model has a fraud success identification rate of 0.58, resulting in a benefit of $577,367.2. So, the total cost the job posting site company incurs is $620,619.4.

SVM Model 2

SVM_fp_cost_2 = fp_cost * cm_SVM_2$table["1", "0"] * m_scalar
SVM_fn_cost_2 = fn_cost * cm_SVM_2$table["0", "1"] * m_scalar
SVM_suc_rate_2 = (cm_SVM_2$table["1", "1"] * m_scalar) / nm_frad
SVM_bonus_2 = bonus * SVM_suc_rate_2
SVM_total_cost_2 = SVM_fp_cost_2 + SVM_fn_cost_2 - SVM_bonus_2

With a model, we ended up with 47,558 false positives, costing $1,664,523. We ended up with 2,312 false negatives, costing $1,155,854. This model has a fraud success identification rate of 0.59, resulting in a benefit of $585,065.4. So, the total cost the job posting site company incurs is $2,235,311.

Random Forest

rf_fp_cost = fp_cost * cm_rf$table["1", "0"] * m_scalar
rf_fn_cost = fn_cost * cm_rf$table["0", "1"] * m_scalar
rf_suc_rate = (cm_rf$table["1", "1"] * m_scalar) / nm_frad
rf_bonus = bonus * rf_suc_rate
rf_total_cost = rf_fp_cost + rf_fn_cost - rf_bonus

With a model, we ended up with 56 false positives, costing $1,957.49. We ended up with 1,734 false negatives, costing $866,890.4. This model has a fraud success identification rate of 0.7, resulting in a benefit of $704,388. So, the total cost the job posting site company incurs is $164,459.9.

KNN Model

KNN_fp_cost = fp_cost * cm_KNN$table["1", "0"] * m_scalar
KNN_fn_cost = fn_cost * cm_KNN$table["0", "1"] * m_scalar
KNN_suc_rate = (cm_KNN$table["1", "1"] * m_scalar) / nm_frad
KNN_bonus = bonus * KNN_suc_rate
KNN_total_cost = KNN_fp_cost + KNN_fn_cost - KNN_bonus

With a model, we ended up with 1,342 false positives, costing $46,979.87. We ended up with 1,398 false negatives, costing $699,105.2. This model has a fraud success identification rate of 0.77, resulting in a benefit of $773,672.1. So, the total cost the job posting site company incurs is $-27,587.04.

ANN Model

ANN_fp_cost = fp_cost * cm_ANN_3$table["1", "0"] * m_scalar
ANN_fn_cost = fn_cost * cm_ANN_3$table["0", "1"] * m_scalar
ANN_suc_rate = (cm_ANN$table["1", "1"] * m_scalar) / nm_frad
ANN_bonus = bonus * ANN_suc_rate
ANN_total_cost = ANN_fp_cost + ANN_fn_cost - ANN_bonus

With a model, we ended up with 2,088 false positives, costing $73,079.79. We ended up with 1,473 false negatives, costing $736,390.8. This model has a fraud success identification rate of 0.72, resulting in a benefit of $723,633.6. So, the total cost the job posting site company incurs is $85,836.98.

Stacked Model

stacked_fp_cost = fp_cost * cm_stacked$table["1", "0"] * sm_scalar
stacked_fn_cost = fn_cost * cm_stacked$table["0", "1"] * sm_scalar
stacked_suc_rate = (cm_stacked$table["1", "1"] * sm_scalar) / nm_frad
stacked_bonus = bonus * stacked_suc_rate
stacked_total_cost = stacked_fp_cost + stacked_fn_cost - stacked_bonus

With a model, we ended up with 1,939 false positives, costing $67,859.81. We ended up with 820 false negatives, costing $410,141.7. This model has a fraud success identification rate of 0.86, resulting in a benefit of $862,201.7. So, the total cost the job posting site company incurs is $-384,200.2.

Results

results <- data.frame(
  "Model" = c("No Model", "Logistic Regression", "Decision Tree", "SVM", "SVM_2", "Random Forest", "KNN", "ANN", "Stacked Model"),
  "Accuracy" = c(0, round(cm_lr$overall["Accuracy"], 4), 
                 round(cm_dt$overall["Accuracy"], 4), 
                 round(cm_SVM$overall["Accuracy"], 4), 
                 round(cm_SVM_2$overall["Accuracy"], 4), 
                 round(cm_rf$overall["Accuracy"], 4), 
                 round(cm_KNN$overall["Accuracy"], 4), 
                 round(cm_ANN$overall["Accuracy"], 4), 
                 round(cm_stacked$overall["Accuracy"], 4)),
  "Sensitivity" = c(0, round(cm_lr$byClass["Sensitivity"], 4), 
                    round(cm_dt$byClass["Sensitivity"], 4), 
                    round(cm_SVM$byClass["Sensitivity"], 4), 
                    round(cm_SVM_2$byClass["Sensitivity"], 4), 
                    round(cm_rf$byClass["Sensitivity"], 4), 
                    round(cm_KNN$byClass["Sensitivity"], 4), 
                    round(cm_ANN$byClass["Sensitivity"], 4), round(cm_stacked$byClass["Sensitivity"], 4))
)
results$FP_Cost <- c(0, lr_fp_cost, dt_fp_cost, SVM_fp_cost, SVM_fp_cost_2, rf_fp_cost, KNN_fp_cost, ANN_fp_cost, stacked_fp_cost)
results$FN_Cost <- c(nm_total_cost, lr_fn_cost, dt_fn_cost, SVM_fn_cost, SVM_fn_cost_2, rf_fn_cost, KNN_fn_cost, ANN_fn_cost, stacked_fn_cost)
results$Benefit <- c(0, lr_bonus, dt_bonus, SVM_bonus, SVM_bonus_2, rf_bonus, KNN_bonus, ANN_bonus, stacked_bonus)
results$Total_Cost <- c(nm_total_cost, lr_total_cost, dt_total_cost, SVM_total_cost, SVM_total_cost_2, rf_total_cost, KNN_total_cost, ANN_total_cost, stacked_total_cost)
results$Cost_Savings <- (nm_total_cost - results$Total_Cost)

results$FP_Cost <- format(round(results$FP_Cost, 2), big.mark = ",")
results$FN_Cost <- format(round(results$FN_Cost, 2), big.mark = ",")
results$Benefit <- format(round(results$Benefit, 2), big.mark = ",")
results$Total_Cost <- format(round(results$Total_Cost, 2), big.mark = ",")
results$Cost_Savings <- format(round(results$Cost_Savings, 2), big.mark = ",")

kable(results, format = "markdown", digits = 4)
Model Accuracy Sensitivity FP_Cost FN_Cost Benefit Total_Cost Cost_Savings
No Model 0.0000 0.0000 0.00 2,421,700.2 0.0 2,421,700.22 0.0
Logistic Regression 0.9629 0.5290 45,022.37 1,211,782.2 561,970.8 694,833.88 1,726,866.4
Decision Tree 0.9631 0.8297 98,527.22 438,105.9 881,447.3 -344,814.16 2,766,514.4
SVM 0.9698 0.5435 23,489.93 1,174,496.6 577,367.2 620,619.37 1,801,080.9
SVM_2 0.5013 0.5507 1,664,522.74 1,155,853.8 585,065.4 2,235,311.15 186,389.1
Random Forest 0.9821 0.6630 1,957.49 866,890.4 704,388.0 164,459.88 2,257,240.3
KNN 0.9726 0.7283 46,979.87 699,105.2 773,672.1 -27,587.04 2,449,287.3
ANN 0.9700 0.6812 73,079.79 736,390.8 723,633.6 85,836.98 2,335,863.2
Stacked Model 0.9724 0.8358 67,859.81 410,141.7 862,201.7 -384,200.20 2,805,900.4

Conclusion